Interpreting single-step genomic evaluation as a neural network of three layers: pedigree, genotypes, and phenotypes

The single-step approach has become the most widely-used methodology for genomic evaluations when only a subset of phenotyped individuals in the pedigree are genotyped, where the genotypes for non-genotyped individuals are imputed based on gene contents (i.e., genotypes) of genotyped individuals through their pedigree relationships. We proposed a new method named single-step neural network with mixed models (NNMM) to represent single-step genomic evaluations as a neural network of three sequential layers: pedigree, genotypes, and phenotypes. These three sequential layers of information create a unified network instead of two separate steps, allowing the unobserved gene contents of non-genotyped individuals to be sampled based on pedigree, observed genotypes of genotyped individuals, and phenotypes. In addition to imputation of genotypes using all three sources of information, including phenotypes, genotypes, and pedigree, single-step NNMM provides a more flexible framework to allow nonlinear relationships between genotypes and phenotypes, and for individuals to be genotyped with different single-nucleotide polymorphism (SNP) panels. The single-step NNMM has been implemented in the software package “JWAS’.


Background
The single-step approach [1][2][3] has been successfully adopted in genomic evaluations when only a subset of phenotyped individuals in the pedigree are genotyped.The single-step approach uses information from genotyped and non-genotyped relatives in two equivalent ways: (a) calculating an improved relationship matrix from pedigree and observed genotypes of genotyped individuals to model the covariances of breeding values for all relatives [1]; or equivalently, (b) imputing genotypes for non-genotyped individuals linearly based on gene contents (i.e., genotypes) of genotyped individuals and the pedigree, then propagating the uncertainty from the imputation by fitting additional random effects accounting for imputation errors in genomic evaluations [3] (see "Appendix").In practice, the linear imputation in (b) can be obtained by modeling the gene content of each marker as a quantitative trait with a very high heritability and fitting the "expected" gene content as random effects based on covariances defined by the pedigree [4].Thus, the latter interpretation (b) of the single-step approach, involves three sequential layers of information: pedigree, genotypes, and phenotypes.This leads to our new representation of the single-step approach as a neural network of three fully-connected sequential layers of information: pedigree (input layer), genotypes (middle layer), and phenotypes (output layer), as demonstrated in Fig. 1.
In previous work, we have proposed a method named "NNMM" (neural network with mixed models) for quantitative genetics, to extend mixed models ("MM") to neural networks ("NN") by adding intermediate layers of data (e.g., gene expression levels) between genotype and phenotype layers [5,6].Better prediction accuracies were observed when intermediate omics data were incorporated into genomic prediction using NNMM.In this paper, we show that NNMM can be adopted to incorporate pedigree, genotype, and phenotype information as a unified network named "single-step NNMM", thus providing a new representation of the single-step approach, and yielding equivalent or higher prediction accuracies, due to the advantages described below.
Single-step NNMM has several advantages over the conventional single-step approach [1][2][3].First, in the conventional single-step approach, gene contents of non-genotyped individuals are imputed based on the genotypes of genotyped individuals only through pedigree relationships.This can be considered as pre-analysis processing using Gengler's method [4], and phenotypes are not included in this pre-analysis.We will show that, in single-step NNMM, such pre-analysis is not needed, and gene contents of non-genotyped individuals can be "imputed" based on pedigree, genotypes, and phenotypes in the Bayesian neural networks using Markov chain Monte Carlo (MCMC).Second, in single-step NNMM, the relationships between genotypes and phenotypes can be approximated by nonlinear activation functions of the neural network to introduce non-linearity between genotypes and phenotypes.Lastly, the conventional singlestep approach requires individuals to be genotyped using the same single nucleotide polymorphism (SNP) panel (i.e., same markers for all genotyped individuals), while single-step NNMM can include individuals genotyped by different SNP panels (i.e., different markers for genotyped individuals) without pre-analysis.
In this paper, we present single-step NNMM for genomic evaluation, study its performance, and compare it to the conventional single-step approach [1][2][3].Here, we focus on studying the effect of fitting pedigree, genotypes, and phenotypes jointly as three unified fullyconnected sequential layers, in which gene contents of non-genotyped individuals are sampled conditional on all three layers of data.The same assumptions of linearity and of individuals being genotyped using the same SNP panel, as in the conventional single-step approach, were used in singe-step NNMM (i.e., a linear activation function, and individuals genotyped with the same SNP panel).

Methods
In single-step NNMM, three sequential layers of information, i.e., pedigree, genotypes and phenotypes, form a unified neural network (instead of two separate steps) as demonstrated in Fig. 1.Mixed models were used to infer unknowns, including missing gene content of nongenotyped individuals and marker effects.In detail, at each iteration of the MCMC, unknowns will be sampled using Gibbs sampling from their full conditional posterior distributions at three levels: (1) from the input layer (pedigree) to the middle layer (gene contents): pedigreebased best linear unbiased prediction (PBLUP); (2) from the middle layer (gene contents) to the output layer (phenotypes): genomic BLUP (GBLUP) or Bayesian Alphabet; and (3) sampling missing values in the middle layer (genotypes for non-genotyped individuals) based on three layers of information, including pedigree, observed genotypes of genotyped individuals, and phenotypes.

From input layer (pedigree) to middle layer (gene contents): Pedigree-based BLUP
Assuming there are m markers (i.e., m nodes in the middle layer), for the jth marker, the observed gene content (i.e., genotypes) of genotyped individuals can be modeled as: where z g,j is a vector of observed gene contents (i.e., gen- otypes coded as 0/1/2) of marker j for genotyped individuals, and µ j is its overall mean with a flat prior; u j is the vector of gene content deviations (i.e., centered genotypes) for individuals in the pedigree with a prior u j ∼ MVN (0, Aσ 2 u j ) , where the covariance matrix is the numerator relationship matrix of individuals in the (1) z g,j = 1µ j + Wu j + ε j , Fig. 1 Framework of single-step NNMM with three fully-connected sequential layers of data: pedigree, genotypes, and phenotypes.Between the layer of pedigree and the layer of genotypes, the gene content of each marker is treated as a quantitative trait, and the pedigree is used to define the random effects covariance matrix.Each node in the middle layer represents the gene content of one marker."NA" denotes missing values.For example, the nodes in the middle layer may be 2,2,0,1,0 for a genotyped individual or all missing ("NA") for a non-genotyped individual.For non-genotyped individuals, all gene contents are missing and will be sampled conditional on pedigree, genotypes, and phenotypes in MCMC pedigree ( A ), scaled by variance component σ 2 u j ; and W is the incidence matrix associating u j with z g,j .The vector of random residuals, ε j , is included to allow the use of mixed model equations, and to account for genotype or pedigree errors [4].The prior of ε j is ε j ∼ N (0, Iσ 2 ǫ j ) .In principle, the heritability of gene content of each SNP should be 1 if the genotypes and pedigree information are perfectly correct and, thus, a small value of the estimated heritability indicates that there are errors in either genotypes or pedigree.Variance components are treated as unknowns in single-step NNMM, and scaled inverse chi-square distributions are assigned as prior distributions for variance components.

From middle layer (gene contents) to output layer (phenotypes): GBLUP or Bayesian Alphabet
The phenotypes can be modeled as: where y is the vector of phenotypes, µ is the overall mean with a flat prior, z j is a vector of (observed and sampled) gene contents for the jth marker ( j = 1, . . ., m ), and α j is the corresponding marker effect.Priors from GBLUP [7][8][9] or the Bayesian Alphabet [10][11][12][13][14][15][16][17][18], such as BayesCπ , can be used for sampling marker effects or breeding values.The vector e represents the residuals of phenotypes, with prior e ∼ N (0, Iσ 2 e ) .The prior distribution of σ 2 e itself follows a scaled inverse chi-square distribution.

Sampling missing values in the middle layer (gene contents)
Here we label the matrices related to non-genotyped and genotyped individuals with subscripts "n" and "g", respectively.For the jth marker, the full conditional posterior distribution of the missing gene content is proportional to the product of its prior and the likelihood: where ELSE includes µ , α j , σ 2 e , µ j , σ 2 u j , and σ 2 ǫ j for j = 1, . . ., m , denoting the current values of all other unknowns except When a nonlinear relationship is assumed between the middle layer (gene contents) and the output layer (phenotypes), Hamiltonian Monte Carlo (HMC) [19] may be employed for sampling missing genotypes.Note that if (2) a linear relationship is assumed, missing genotypes can be sampled directly from a normal distribution at each iteration.

Data analysis
Assuming linear relationships between genotypes (middle layer) and phenotypes (output layer), and that the same SNP panel is used for all genotyped individuals in the conventional single-step approach, we applied the same assumptions in the single-step NNMM (i.e., a linear activation function and individuals genotyped with the same SNP panel) to compare the prediction performance of these two methods.Thus, GBLUP was employed between the middle layer (gene contents) and the output layer (phenotypes) in the single-step NNMM (i.e., SS-NN-GBLUP), and its performance was compared to the conventional single-step GBLUP approach (i.e., SS-GBLUP).
The pig dataset from [20] was used, which includes 3534 genotyped individuals, and a pedigree of 6473 individuals including parents and grandparents of the genotyped animals.Estimates of heritability of gene content for each marker were close to 1 [21].In our analysis, we used 10,000 randomly-selected SNPs as the genotype data.A random sample of 0.5%, i.e. 50, of these markers was selected as quantitative trait loci (QTL), and they were included in the genotypes.Phenotypes were simulated with a heritability of 0.7 and a phenotypic variance of 1.The 100 youngest individuals, whose genotypes were observed but phenotypes were unknown, were used for testing, while the remaining individuals (i.e., 3434 individuals) with known phenotypes were used for training.
To compare the single-step NNMM with the conventional single-step method in this study, different proportions of non-genotyped individuals in the training dataset were considered, including 30, 50, 70, and 90%, and there were 10 replicates for each scenario.For each replicate, individuals were randomly selected to be non-genotyped individuals.The prediction accuracy was calculated as the Pearson correlation between the true breeding values and the estimated breeding values for individuals in the testing dataset.In single-step NNMM, at least 2000 MCMC iterations were applied to ensure convergence.
In single-step NNMM, the heritability ( h 2 ) of gene content in Eq. 1 can be considered as known to be 1 or unknown.When the heritability is considered known, a value close to 1 (i.e., h 2 = 0.999 ) is used to facilitate the use of mixed model equations.In single-step NNMM, two strategies were used to sample missing genotypes of non-genotyped individuals, i.e., missing genotypes were sampled conditionally on or unconditionally on phenotypes.
Unlike the conventional single-step approach, which requires individuals to be genotyped using the same SNP panel (i.e., identical markers for all genotyped individuals), the single-step NNMM can accommodate individuals genotyped with different SNP panels (i.e., varying markers for genotyped individuals).Thus, we also tested scenarios where SNP sets differed among individuals.

Results
In single-step NNMM (SS-NN-GBLUP, i.e., single-step NNMM with GBLUP between middle and output layer), when the heritability ( h 2 ) in Eq. 1 is considered unknown, the estimated heritability for each SNP was very close to 1.0, and similar results for SS-NN-GBLUP were observed regardless of whether the heritability ( h 2 ) in Eq. 1 was assumed known (i.e., h 2 = 1 ) or unknown.Thus, only the results obtained with SS-NN-GBLUP with h 2 = 1 in Eq. 1 are presented.We compared the results of the conventional single-step method (SS-GBLUP, i.e., conventional single-step GBULP) and single-step NNMM (SS-NN-GBLUP) when missing genotypes of non-genotyped individuals were sampled conditionally on phenotypes, as described in Eq. 3. The results, presented in Table 1, demonstrate the prediction accuracy when various proportions of phenotyped individuals were genotyped.In general, the prediction accuracy of both methods decreased as the proportion of non-genotyped individuals increased.Overall, the SS-NN-GBLUP displayed a similar prediction accuracy to the SS-GBLUP approach, with no significant differences observed (pairwise t-test at a significance level of p < 0.01).The correlation between estimated marker effects from these two methods was high.Both the conventional single-step approach and single-step NNMM had significantly higher prediction accuracies compared to GBLUP using genotyped individuals only.The running time of SS-NN-GBLUP was less than 2 h using 20 central processing units (CPUs), while the conventional SS-GBLUP only took a few minutes (see "Discussion").In addition, similar results were observed for SS-NN-GBLUP whether missing genotypes were sampled conditional on or unconditional on phenotypes.
We also tested scenarios where SNP sets differed among individuals, and the prediction accuracies aligned with our expectations.For example, when we randomly introduced 50% missing values in the genotype covariate matrix of the training dataset, the prediction accuracy was 0.767, with a standard deviation of 0.011 across 10 replications.This result is reasonable when compared to our previous findings.Note that when all individuals were genotyped, the prediction accuracy of GBLUP was 0.849.

Discussion
In this paper, we propose a new method named singlestep NNMM, which presents a novel framework for single-step methods by treating gene content (i.e., genotypes) as a middle layer of data between pedigree and phenotypes.Single-step NNMM represents single-step genomic evaluations as a neural network of three sequential layers: pedigree, genotypes, and phenotypes.Singlestep NNMM is based on linear mixed models, i.e.PBLUP between the input layer (pedigree) and middle layer (gene content) and GBLUP/Bayesian Alphabet between the middle layer and the output layer (phenotype).This approach allows us to benefit from the implementation and optimization of well-studied linear mixed models for genomic prediction.Using the pedigree-based relationship matrix as an input of a neural network is not new.Gianola et al. [22] have shown that PBLUP is equivalent to a single (middle) layer neural network with a linear activation function, when the input is a pedigreebased relationship matrix.However, single-step NNMM extends conventional mixed models to a neural network with heterogeneous input data across multiple layers (more than two, i.e., pedigree, genotypes, phenotypes), whereas conventional mixed models or neural networks only consider two layers of data (input and output layers).
Compared to the conventional single-step method, the three sequential layers of information in single-step NNMM form a unified network, rather than two separate steps.Thus, the unobserved gene contents of non-genotyped individuals can be sampled based on information from all three layers: pedigree, observed genotypes of genotyped individuals, and phenotypes.Single-step NNMM offers a highly flexible framework for single-step methods, which allows nonlinear relationships between gene contents and phenotypes, as well as the genotyping of different individuals using distinct SNP panels (i.e., various patterns of missing genotypes).The single-step NNMM has been implemented in the software package "JWAS" [23,24].
In our comparison, the same assumptions of linearity and identical SNP panels, as in conventional single-step approach, were used in singe-step NNMM.Overall,

Table 1 Comparison of prediction performances between conventional single-step (SS-GBLUP) and single-step NNMM (SS-NN-GBLUP)
The average prediction accuracies from 10 replications with the standard deviation in brackets when some individuals were not genotyped, singlestep NNMM had similar prediction accuracy as the conventional single-step approach, and the correlation between estimated marker effects from these two methods was high.Both conventional single-step approach and single-step NNMM had significantly higher prediction accuracies compared to GBLUP using genotyped individuals only.
As we have described, in addition to allowing nonlinearity and individuals being genotyped with different SNP panels, a difference between single-step NNMM and the conventional single-step approach is in genotype imputation.Besides genotypes and pedigree, phenotypic information can also be used in the sampling of missing genotypes for non-genotyped individuals in single-step NNMM.However, similar prediction accuracies were observed regardless of whether missing genotypes were sampled conditional on or unconditional on phenotypes in SS-NN-GBLUP.For polygenic traits, this observation may be attributed to at least two reasons.First, a single SNP contributes only a small proportion of heritability and the correlation between the gene content of one SNP and phenotypes is generally low.As a result, incorporating phenotypic information into genotype imputation may introduce more noise than useful information.Second, phenotypes aid only in the imputation of causal variants, and variants in high linkage disequilibrium with causal variants.However, phenotypic information is employed in the imputation of all SNPs and can potentially introduce errors in marker imputation.However, when genotypes of relatives provide limited information (e.g., most individuals are not genotyped), the additional benefits in genotype imputation by including phenotypic information may not be negligible.
To enhance the applicability of our method to more realistic datasets, we implemented parallel computing using Message Passing Interface (MPI) [25], taking advantage of multiple computer processors' capabilities.Ideally, with a sufficient number of computer processors, the computation time from the input layer (pedigree) to the middle layer (gene content) would be equal to the time required for one PBLUP, which should be relatively fast.The speed improvement from parallel computing is limited, however, by the hardware used.In our analysis of the pig dataset (i.e., 6473 individuals in the pedigree, 10,000 SNPs, and 3534 individuals with genotypes), running 2000 MCMC iterations on this dataset using 20 central processing units (CPUs) took less than 2 h for single-step NNMM, while the conventional single-step approach only took a few minutes.In future research, we plan to explore the use of graphics processing units (GPUs), which are commonly employed in neural networks, and more advanced parallel computing strategies (e.g., [26,27]).

Conventional single-step approach as linear imputations
In a conventional single-step approach, a pre-analysis processing is used to impute the gene contents of non-genotyped individuals from the genotypes of genotyped individuals through pedigree relationships.In detail, the centered gene content of each marker is treated as a normallydistributed quantitative trait as z n,j z g,j ∼ MVN (0, A2p j q j ) , where p j is the allele frequency and q j = 1 − p j .Under Hardy-Weinberg equilibrium, the variance of marker is 2p j q j .Thus, the distribution of z n,j conditional on z g,j is a multivariate normal distribution: This can be written as: where ẑn,j = A ng A −1 gg z g,j is the imputed genotypes for non-genotyped individuals, and r n,j is the imputation uncertainty with var(r n,j ) = (A nn − A ng A −1 gg A gn )2p j q j .Extending the above equation to multiple markers, we have Ẑn = A ng A −1 gg Z g .After the pre-analysis processing, both imputed and observed genotypes will be used for genomic evaluation.An additional random effect will be used to account for the uncertainty from the genotype imputation.In detail, the phenotypes are written as: where µ is the overall mean, Ẑn is a matrix of imputed genotypes for non-genotyped individuals, Z g is a matrix of observed genotypes for genotyped individuals, α is a vector of marker effects, and e is the vector of random residuals with e ∼ MVN (0, Iσ 2 e ) .r * = m j=1 r n,j α j is a random vector to account for the sum of weighted imputation uncertainty across all markers with variance: (4) z n,j |z g,j ∼ N (A ng A −1 gg z g,j , (A nn − A ng A −1 gg A gn )2p j q j ).
The conventional single-step GBLUP approach has been extended for Bayesian regression models to accommodate more flexible assumptions of the marker effects [3].It has been shown that the single-step Bayesian regression model with a normal prior for marker effects (7) var(r * ) = m j=1 var(r n,j α j ) is equivalent to the single-step GBLUP approach in terms of predicting genetic values.

Sampling missing genotypes
Assuming a linear relationship between the middle layer (gene contents) and the output layer (phenotypes), the unobserved genotypes of non-genotyped individuals can be sampled from a normal distribution.Here we label the matrices related to non-genotyped and genotyped individuals with subscripts "n" and "g", respectively.For the jth marker ( j = 1, . . ., m ), let n n denote the number of non-genotyped individuals.The full conditional posterior distribution of unobserved genotypes for these non-genotyped individuals ( z n,j ) is as follows: (8)  From the full conditional posterior distribution of z n,j , it is evident that elements within z n,j can be indepen- dently sampled from a univariate normal distribution.However, the full conditional posterior distribution of z n,j is dependent on all other markers z n,j ′ through the term c n,j = y n − 1µ − j ′ � =j z n,j ′ α j ′ .Consequently, missing genotypes for all markers cannot be sampled simultaneously.

Sampling missing genotypes (unconditional on phenotypes)
Our results have indicated that phenotypic information provides limited improvement for genotype imputation.From a computational perspective, when sampling missing genotypes conditional on phenotypes, the full conditional posterior distribution of z n,j depends on all other markers z n,j ′ through the term y n − 1µ − j ′ � =j z n,j ′ α j ′ .This dependency makes the imputation of each marker dependent, which needs the use of approximations to enable parallel computing.
Therefore, below we present the sampling of missing gene content without conditioning on phenotypes.In this scenario, the single-step NNMM is identical to conventional single-step methods.The full conditional posterior distribution of z n,j is as follows: where d n,j = 1µ j + u n,j .
The full conditional posterior distribution of z n,j is independent of z n,j ′ , allowing for simultaneous sampling of missing genotypes (i.e., z n,1 , z n,2 , . . ., z n,m ).