 Research article
 Open Access
 Published:
Locally epistatic models for genomewide prediction and association by importance sampling
Genetics Selection Evolutionvolume 49, Article number: 74 (2017)
Abstract
Background
In statistical genetics, an important task involves building predictive models of the genotype–phenotype relationship to attribute a proportion of the total phenotypic variance to the variation in genotypes. Many models have been proposed to incorporate additive genetic effects into prediction or association models. Currently, there is a scarcity of models that can adequately account for gene by gene or other forms of genetic interactions, and there is an increased interest in using marker annotations in genomewide prediction and association analyses. In this paper, we discuss a hybrid modeling method which combines parametric mixed modeling and nonparametric rule ensembles.
Results
This approach gives us a flexible class of models that can be used to capture additive, locally epistatic genetic effects, genebybackground interactions and allows us to incorporate one or more annotations into the genomic selection or association models. We use benchmark datasets that cover a range of organisms and traits in addition to simulated datasets to illustrate the strengths of this approach.
Conclusions
In this paper, we describe a new strategy for incorporating genetic interactions into genomic prediction and association models. This strategy results in accurate models, with sometimes significantly higher accuracies than that of a standard additive model.
Background
The genetic basis and evolutionary causes of quantitative variation were first proposed at the end of the nineteenth century [1, 2]. The statistical tools developed (correlation and regression) were the foundation of biometry science. Considerable efforts were made to identify the genetic architecture of traits by mapping quantitative trait loci (QTL) in humans, animals and plants [3,4,5]. Quantitative genetic theory focuses on finding the underlying genetic variation in genes by applying the classical infinitesimal (polygenic) model [2, 6].
In the infinitesimal model, the genetic values of individuals are assumed to be generated by an infinite number of unlinked and nonepistatic genes, each with an independent infinitesimal effect. In this model, quantitative genetics focuses on the additive effects of individual alleles. The rate of change of a trait and the genotypic variance depend primarily on additive effects, hence interaction terms are often neglected. However, while for many QTL, thousands of studies have been carried out, few examples that have successfully exploited mapped QTL have been reported in the literature [7]. Indeed, although genomewide association studies (GWAS) have discovered hundreds of single nucleotide polymorphism (SNPs) significantly associated with complex traits [8,9,10,11,12], they have explained only a small proportion of the estimated genetic variation, a term coined “missing heritability” [13]. Using SNP data to detect loci with a large effect by associating common phenotypes with common genotypes, provides a way to capture the infinitesimal effects. In addition, the genomewide predictive models, which are mainly used in genomic selection of animals or plants showed that the results from models that assume additive infinitesimal effects can be accurate and informative [14]. The infinitesimal model has very powerful simplifying statistical properties and avoids the need to specify individual gene effects [15]. Association studies that involve complex interactions between loci are complicated by the large number of effects that need to be tested simultaneously. Most GWAS studies and prediction models focus on estimating the effects of each marker and lower level interactions [16]. For a dataset of m markers, a genomewide analysis with two loci will involve evaluating a number of possibilities of the order of \(m^2\) and m can easily exceed millions. Because of the consequent multiple testing burden, the methods used to identify and model epistasis lack statistical power, and they are computationally exhaustive or even unfeasible. Several factors make it difficult to estimate the true numbers and the effects of loci that influence a QTL. The detection of epistasis is a key factor for explaining the “missing heritability” [17]. The form and strength of epistasis that we expect for QTL depend crucially on the specific details of gene action. Gene interactions are important because (1) they cause the additive effects of alleles to change as the genetic composition of the population changes [18] and (2) they might also slow down selection response, because alleles might only become favorable as the genetic background changes during selection [19]. A common approach to identify interactions is to test SNPs with the most significant additive effects. This approach can be problematic since the absence of additive effects might be generated by interacting loci. In this article, we propose a new hybrid (machine learning + mixed models) approach that (1) results in a flexible class of models that can be used to capture additive, locally epistatic genetic effects, and gene \(\times\) background interactions, and (2) allows to incorporate one or more annotations into the genomic prediction and association models. Thus, the main aim of this study was to measure and incorporate additive and local epistatic genetic contributions to complex traits.
Methods
Materials
The genetic material used in this study to compare our novel prediction and association models to the standard linear genomic BLUP (GBLUP) model consists of four different datasets on maize, rice, wheat and mouse (Table 1). The maize dataset which was used in previous studies [22, 23] was downloaded from panzea.org. The rice dataset can be downloaded from www.ricediversity.org and was used in [24,25,26]. The wheat dataset was downloaded from the triticale toolbox dataset www.triticaletoolbox.org and the mouse dataset, published in [27], was accessed from the “synbreeddata” package [28] available in R [29]. To determine if the locally epistatic rules (LER) model can be used to locate interacting loci and to compare its results to those from a standard additive mixed modeling approach, we devised the following experiment. We simulated 1000 independent SNPs that were coded 0, 1 and 2 for 2000 individuals. Five genetic effects \(g_i, i=1,\ldots , 5\) at five loci were generated according to the formulas in Table 2. Each effect was standardized to have a variance of 1 across the genotypes and the total genotypic value of a genotype was calculated as the sum of these effects. Each of these quantitative trait loci involved three closely located SNPs. Effect \(g_1\) was completely additive, while the other effects contained SNP by SNP interactions, SNP by background interactions or both. The formula for each of these effects are in Table 2. The individuals were evenly assigned to one of the two sexes at random, which in turn was reflected in the genetic values as a fixed difference of 5 units. The final phenotypes for the individuals were obtained by adding independent and identically distributed, zero centered normal random variables to genetic values to obtain a broadsense heritability of 2/3. In this simulation study, we have partitioned the genome into 10 segments for the LER model. The whole experiment was replicated 100 times to obtain the results in Table 3.
Methods
In statistical genetics, an important task involves building predictive and association models of the genotype–phenotype relationship to attribute a proportion of the total phenotypic variance to the variation in genotypes. There are numerous statistical models used in genomic prediction and association (see Fig. 1). An evaluation of these methods for predicting quantitative traits is in [30]. Many of the models used for genomic prediction and association are additive. These include ridge regressionbest linear unbiased prediction (rrBLUP) [31, 32], Lasso [33], Bayesian–Lasso [34], Bayesian ridge regression, Bayesian alphabet [35, 36]), GBLUP. Some methods for capturing genomewide epistasis include the reproducing kernel Hilbert spaces regression (RKHS) approach [37, 38] and related support vector machine regression or partitioning based on random forest [39]. These models can be used to predict genetic values but do not provide satisfactory information about the genetic architecture of traits. An alternative approach when studying epistasis is to consider only local epistasis [40], i.e., only epistatic interactions between closely located loci. It is reasonable to assume that only the epistatic effects that arise from alleles in gametic disequilibrium, between closely located loci can contribute to longterm response since recombination disrupts allelic combinations that have specific epistatic effects. In a recent article [40], we proposed a modeling approach that uses RKHSbased approaches to extract locally epistatic effects, which we referred to as the locally epistatic kernels (LEK) model. It was shown in [40] that LEK models could be used to improve prediction accuracies and provide useful information about genetic architecture.
Importance sampling learning ensembles and rule ensembles
The LER models introduced here uses the importance sampling learning ensembles (ISLE) [41] based rule extraction procedures for genomic prediction and GWAS. We described and illustrated the use of ISLEbased approaches with genomic data in [42]. Nevertheless, for completeness, we include a description of ISLEbased ensemble model generation procedure in this section. Given a learning task and a relevant dataset, we can generate an ensemble of models from a predetermined model family; an ensemble of models is a single model that combines this ensemble of models. The main discovery is that ensemble models are much more accurate than the individual models that make them up when the individual members of the ensemble are accurate (low bias) and diverse (high variance). Ensemble models are shown to perform extremely well in a variety of scenarios and to have desirable statistical properties. Members of the ensemble are generated by fitting models from the chosen family to perturbed data. For instance, bagging [43] bootstraps the training dataset and produces a model for each bootstrap sample. Random forest [39, 44] creates models by randomly selecting a subset of observations and / or explanatory variables while generating each model. Boosting is a biasreduction technique, AdaBoost iteratively builds models by varying case weights and using the weighted sum of the estimates of the sequence of models. There have been attempts to unify these ensemble learning methods. One such framework is the ISLE. Let us assume that we want to generate an ensemble of models for predicting the continuous outcome variable y from vector p of input variables \(\varvec{x}\) from a model family \(\mathcal {F}=\{f(\varvec{x}, {{\varvec{\theta }}}): {{\varvec{\theta }}}\in \ominus\}\) indexed by the parameter \({{\varvec{\theta }}}.\) The final ensemble model produced by the ISLE framework has an additive form:
where \(\{f(\varvec{x}, {{\varvec{\theta }}}_j)\}_{j=1}^{\textit{M}}\) are base learners selected from \(\mathcal {F}.\) A twostep approach is used to produce \(F(\varvec{x})\). The first step involves sampling the space of possible models to obtain \(\{\widehat{{{\varvec{\theta }}}}_j\}_{j=1}^{\textit{M}}\). The second step involves combining the base learners by choosing weights \(\{w_j\}_{j=0}^{\textit{M}}\) in Eq. (1). The pseudo code to produce \(\textit{M}\) models \(\{f(\varvec{x}, \widehat{{{\varvec{\theta }}}}_j)\}_{j=1}^{\textit{M}}\) under the ISLE framework is given in Algorithm 1:
Here, L(., .) is a loss function; \(S_j(\eta )\) is a subset of the indices \(\{1,2,\ldots , n\}\) chosen by a sampling scheme \(\eta ,\) and \(0\le \nu \le 1\) is a memory parameter. The classic ensemble methods of bagging, random forest, AdaBoost, and gradient boosting are special cases of the ISLE ensemble model generation procedure [45]. In bagging and random forests, the weights in Eq. (1) are set to predetermined values, i.e. \(w_0=0\) and \(w_j=\frac{1}{\textit{M}}\) for \(j=1,2,\ldots ,\textit{M}.\) Boosting calculates these weights in a sequential fashion at each step by having positive memory \(\nu ,\) estimating \(c_j\) and takes \(F_\textit{M}(\varvec{x})\) as the final prediction model. Friedman and Popescu [41] recommend learning the weights \(\{w_j\}_{j=0}^{\textit{M}}\) using Lasso [33]. Let \(\mathbf T ={\left( T_j(\varvec{x}_i) \right) _{i=1}^n}_{j=1}^{\textit{M}}\) be the \(n\times \textit{M}\) matrix of predictions for the n observations by the \(\textit{M}\) models in an ensemble. The weights \((w_0,\varvec{w}=\{w_m\}_{m=0}^{\textit{M}})\) are obtained from:
\(\lambda >0\) is the shrinkage operator, larger values of \(\lambda\) decrease the number of models included in the final prediction model. The final ensemble model is given by:
The ISLE approach produces a generalized additive model (GAM) [46]. A few other postprocessing approaches such as partial least squares regression, multivariate kernel smoothing, and weighting, as well as the use of rules in semisupervised and unsupervised learning, are described in [42]. There is no restriction on the choice of the family of base learners, \(\mathcal {F},\) in the ISLE procedure. The most popular choice for the base learners is the class of regression and classification trees. Treebased methods have the advantage of being virtually assumption free, they are simple to fit and interpret. They can capture interactions and handle missing values by using surrogate splits. In addition, trees can naturally handle all types of input variables, i.e., numeric, binary, categorical. They are invariant under monotone transformations and scaling of the variables. Trees have a high variance on these data due to the correlation in the predictors. An ensemble of tree models succeeds in smoothing out this variance and hence reduces test error.
A tree with K terminal nodes defines a K partition of the input space where the membership to a specific node, say node k, can be determined by applying the conjunctive rule \(r_k(\varvec{x})=\prod\nolimits_{l=1}^{p}I(x_l\in s_{lk}),\) where I(.) is the indicator function, \(\varvec{x}=(x_1,x_2,\ldots , x_p)\) are the input variables. The regions \(s_{lk}\) are intervals for continuous variables and a subset of the possible values for a categorical variables. The easiest way to create an ensemble of rules is to extract it from an ensemble of decision trees. In a tree, each path from the root node to a leaf defines a rule. An example of a regression tree and the corresponding rules extracted from this tree are displayed in Fig. 2. The complexity of trees or rules (the degree of interactions between the input variables) in the ensemble increases as the number of nodes increases from the root to the final node (depth). Individual trees can be pruned using a cost complexity criterion. For example, a popular cost complexity criterion [48] that balances the residual sum of squares and the complexity of the tree can be written as:
where T is a regression tree, \(cp \ge 0\) the complexity parameter/regularization parameter, \(nodes(\widehat{T})\) denotes the number of nodes of tree \(\widehat{y},\) and RSS(T) is the residual sum of squares of the tree. In addition, the parameters minbucket, minsplit, and maxdepth constrain the solution to a minimum number of observations in each terminal node, a minimum number of observations in each internal node, and a maximum treedepth. There are numerous options for building tree models: these include iterative dichotomiser 3 (ID3) [49], C4.5 [50], classification and regression trees (CART) [48], etc. In this paper, the model \(f(\mathbf X )\) in Eq. (6) at each iteration of the EM algorithm was extracted by the CART approach using the R package rpart [51]. Suppose an ensemble of tree models was generated by the ISLE algorithm in Algorithm 1 and let \(\mathbf{R}={\left( r_k(\varvec{x}_i) \right) _{i=1}^n}_{k=1}^K\) be the \(n\times K\) matrix of rules derived from this ensemble of trees. The rulefit algorithm of Friedman and Popescu [52] uses the weights \((w_0,\varvec{w}=\{ {w_k}\}_{k=0}^{K})\) that are estimated from:
in the final prediction model:
Locally epistatic models via rules (LER)
When applying the ISLE approach to genomic data special considerations need to be taken into account because of the dependencies among features such as the arrangement (localization, spacing and number) of SNPs on chromosomes, the linkage between SNPs, or annotations that put certain SNPs in the same groups. Joint consideration of linkage and epistasis is a necessary step for the models that incorporate the interactions for more than one locus. Complex systems that use evolutionary mechanisms such as selection proportional to fitness, recombination and mutation tend to generate short adapted and specialized structures the number of which will increase exponentially in successive generations. For instance, the scheme theorem of Holland [53] can be stated as:
where N(H, t) is the frequency of a haplotype (schema) H at time t, and c the relative fitness of H compared to all other haplotypes in the population. It can be argued that \(Pr(\text {H is lost due to recombination})\) is an increasing function of the linkage length of H and \(Pr(\text {H is lost to mutation})\) is an increasing function of the order of H, i.e., number of loci in H that affect its fitness. The epistatic effects involving unlinked loci have a high probability of being lost due to recombination and will not contribute to the subsequent response. This argument forms the basis of the “building blocks” hypothesis in the evolutionary theory. Because of these considerations that are unique to genomic data, we propose a modification of the ISLE algorithm so that the interactions among genes are restricted to local genomic regions. Locally epistatic rulebased model fitting starts with a definition of genomic regions; suppose we defined k such regions. Region definition is followed by the extraction of local rules from each genomic region \(j=1,2,\ldots , k.\) using the ISLE algorithm. The rules are extracted from trees that predict the estimated genetic value from SNPs in the region. Since the rules are independently generated for each region, this step can be computationally accomplished in parallel without loading all the genetic data to computer memory. The values of the rules from all regions are calculated for the n training individuals, they are standardized with respect to their sample standard deviation and combined in a matrix \(n\times r\) matrix \(\mathbf R .\) The model behind the proposed mixed effects regression tree method is:
\(\mathbf{g}\sim N(\varvec{0},\sigma ^{2}_{g} \mathbf G ),\) \(\mathbf{e}\sim N(\varvec{0},\sigma ^2\mathbf I _{m}),\) where all quantities are defined as in a classical linear mixed effects model except that a more general and unspecified fixed part, \(f(\mathbf X ),\) which describes the vector obtained by applying the function f to each row of \(\mathbf X ,\) now replaces the usual linear part \(\mathbf X {{\varvec{\beta }}}\) which will be estimated with a single tree. The random part, \(\mathbf{Z }\mathbf{g},\) is still assumed linear with a covariance structure given by \(\sigma ^{2}_{g} \mathbf G ,\) where \(\mathbf G\) is an additive genetic similarity matrix. Given \(\mathbf M ,\) the marker allele frequency centered incidence matrix, the matrix \(\mathbf G\) can be calculated as \(\mathbf G = \mathbf M {} \mathbf M '/k\) where k is the sum of the variances of the SNPs [54]. The MLbased EMalgorithm to fit this model is described in [55] and [56]. As with the mixed model association methods, the aim of including a random term that accounts for the genetic structural effects is to correct for confounding between the familial effects and the effects of the loci. The second step in locally epistatic rule based model fitting is the postprocessing step where we obtain a final prediction model using the extracted rules as input variables. The mixed models (MM) methodology has a special place in quantitative genetics because it provides a formal way of partitioning the variability observed in traits into heritable and environmental components, it is also useful to control for population structure and relatedness for GWAS. In a mixed model, genetic information in the form of a pedigree or SNP data can be used in the form of an additive genetic similarity matrix that describes the similarity based on additive genetic effects (GBLUP). For the \(n\times 1\) response vector \(\mathbf{y},\) the GBLUP model can be expressed as:
where \(\mathbf X\) is the \(n\times p\) design matrix for the fixed effects, \({{{\varvec{\beta }}}}\) is a \(p\times 1\) vector of fixed effects, \(\mathbf Z\) is the \(n\times q\) design matrix for the random effects; the vectors of random effects \(\mathbf{g}\) and \(\mathbf{e}\) are assumed to be independent multivariate normal (MVN) random variables with means \(0\) and corresponding covariances \(\sigma ^{2}_{g} \mathbf G\) and \(\sigma ^{2}_{e} \mathbf I _{n}\) where \(\mathbf G\) is the \(q\times q\) additive genetic similarity matrix. It is known that model in Eq. (7) is equivalent to a MM in which the additive marker effects are estimated via the following model (rrBLUP):
where \(\mathbf X\) is the \(n\times p\) design matrix for the fixed effects, \({{\varvec{\beta }}}\) is a \(p\times 1\) vector of fixed effect coefficients, \(\mathbf Z\) is the \(n\times q\) design matrix for the random effects \(\mathbf M\) is \(q\times m\) marker allele frequency centered incidence matrix; \(\mathbf{u}\) and \(\mathbf{e}\) are assumed to be independent MVN random variables with means \(0\) and corresponding covariances \(\sigma ^{2}_{u} \mathbf{I}_m\) and \(\sigma ^{2}_{e} \mathbf I _{n}.\) The conversion between the predicted genotypic values \(\widehat{\varvec{g}}\) in Eq. (7) and the predictions for marker effects \(\widehat{\varvec{u}}\) in Eq. (8) are given by:
In this article, we use the rrBLUP model for postprocessing the rules:
where \(\mathbf Z\) is a \(n\times q\) design matrix for the random effects, \(\mathbf R\) is a \(q\times r\) design matrix for the centered and scaled rules, and \((\varvec{\alpha }',\varvec{e}')'\) follows a MVN distribution with mean \(0\) and covariance ,
Note that each rule is a function of the SNPs. Using estimated coefficients, \(\widehat{\varvec{\alpha }},\) we calculate the estimated genotypic value for an individual with SNPs \(\varvec{m}\) as \(\widehat{R(\varvec{m})}\widehat{\varvec{\alpha }}\) where \(R(\varvec{m})=(R_1(\varvec{m}),R_2(\varvec{m}),\ldots , R_r(\varvec{m})).\)
Importance and interaction measures
In addition to having a good prediction performance, a good model should also provide a description of the relationship between the input variables and the response. The rules and the estimated coefficients of the LER model can be used extract several importance and interaction measures. Let \(I(m_\ell \in R_j)\) denote the indicator function for the inclusion of SNP \(\mathbf M _\ell\) in rule \(R_j.\)

Since \(\mathbf R\) has standardized columns, \(\widehat{\varvec{\alpha }}\) can be used as importance scores for the rules in the model.

A measure of importance for a SNP \(\ell\) is obtained by \(I_j=\sum\nolimits_{j=1}^{r}\widehat{\alpha _j} I(m_\ell \in R_j).\)

A measure of the interaction strength between two SNPs \(\ell\) and \(\ell '\) is obtained by: \(I_{\ell \ell '}=\sum\nolimits_{j=1}^{r}\widehat{\alpha _j} I(m_\ell \in R_j)I(m_{\ell '}\in R_j).\)

A measure of the interaction strength between SNPs \(\ell _1, \ell _2,\ldots , \ell _l\) is given by \(I_{\ell _1 \ell _2\ldots ,\ell _l}=\sum\nolimits_{j=1}^{r}\widehat{\alpha _j} \prod\nolimits _{k=\ell _1}^{l}I(m_{\ell _k}\in R_j).\)

Importance of a region: Sum of the rule or marker importances within a region.
The variable importance and interaction measures are in line with the stability selection methods [57, 58]. With stability selection, the data are perturbed (for example by subsampling) many times and the structures or variables that occur in a large fraction of the resulting selection sets are deemed important.
“Tuning” the LER algorithm
While fitting the model in Eq. (1), we need to decide on the values of a number of arguments (hyperparameters) to control the fitting behavior. Hyperparameter settings can have a strong impact on the prediction accuracy of the trained model. Optimal hyperparameter settings often differ for different datasets. Therefore, they should be tuned for each dataset. Since the model training process does not set the hyperparameters, a meta process for tuning the hyperparameters is needed. Conceptually, hyperparameter tuning is an optimization task, just like model training. The hyperparameters in LER models may be selected by comparing the crossvalidated accuracies within the training dataset for several reasonable choices. For each proposed hyperparameter setting, the inner model training process comes up with a model for the dataset and outputs evaluation results on holdout or crossvalidation datasets. After evaluating a number of hyperparameter settings by a method like grid or random search, the hyperparameter tuner settings that yield the best performing model are used. The choice of the hyperparameter for the LER models should also reflect the available resources and the needs. For instance, the number of regions that we can define depends on the number of SNPs and on the resolution that the dataset allows, and a more detailed analysis might only be suitable when the number of SNPs and the number of genotypes in the training dataset are large. The LER methodology provides the user with a range of models with different levels of detail, sparsity, and interactions. The depth of a rule is a hyperparameter of the LER models since it controls the degree of interaction. A term involving the interaction of a set of variables can only enter the model if there is a rule that splits the input space based on those variables. One way to control the amount of interactions is to grow the trees to a certain depth. We can call this parameter the “maxdepth” parameter. In this article, we allowed different rules to enter the model by setting the “maxdepth” of each tree independently to a random variable generated from a truncated Poisson distribution that turned the parameter into a continuous one which controls the “mean depth” of rules. This allows a diverse set of rules with different depths. The “mean depth” parameter controls the distribution of the complexity of the rules that comprise the ensemble. A choice can be based on the apriori suspicions about the nature of the target. We also note that the number of rules in a tree is increased by the order of the square of the “mean depth” parameter. The effect of increasing this parameter is a decrease of the “mean depth” of the trees and “mean depth” and the number of rules extracted from each tree. The trees and the associated rules can be pruned during extraction with heuristics such as complexity cost pruning, or reduced error pruning [59]. The hyperparameters “proprow” and “propcol” control the number of sampled rows and columns from the full data for training an individual tree. Precision of the resulting trees and rules increases whereas their accuracy decreases as either of these parameters decrease. Improving the prediction accuracy of a tree and the precision of its splits is a balancing act; in general, we should aim at having enough examples and a manageable number of SNPs for each run of the tree extraction. “nrules” is a related hyperparameter that controls the number of rules to be extracted from each genomic region. In order to use all of the training data and to have reliable importance statistics, each genotype and each marker should be sampled several times during the extraction of rules. The detailed steps taken during the model fitting process are provided in an algorithmic form in Algorithm 2:
For the maize dataset, we used two settings to split the SNPs into contiguous and nonintersecting regions. In the first setting, each chromosome was split into 10 segments by dividing the chromosome into blocks with approximately the same number of SNPs. In the second setting, we used maize recombination hotspots [60] to split each maize chromosome into 40 segments. The rules were extracted using the SNP in each region along with the first three principal components (PC) of the genomewide SNPs. The rice, wheat, and mouse datasets were treated similarly. The details of the settings of the LER algorithm for each dataset are in Table 1. To show that the model is robust over reasonable choices of the hyperparameter values, we included the results for several hyperparameter settings in Additional file 1. Additional file 2 provides sample code for replicating the simulated experiment.
Results
Figure 3 shows the accuracies obtained with the LER and GBLUP models for each of the 30 replicates for each trait in all datasets. The red colored data points show the cases for which the LER models performed better than the GBLUP models. The number of times that each model performed better than the other is shown on the top left side for each dataset. The performance for each trait can be evaluated also from the same figure using the legends. In general, the LER models performed better than GBLUP for all datasets and particularly well for traits with a complex genetic architecture, i.e., generated by a large number of genes with small effects, e.g. in the maize dataset, for growing degree days to anthesis (GDD_DTA) and to silking (GDD_DTA), anthesissilking interval and plant height, yield components and in the mouse dataset for body weight. The results in Fig. 3 correspond to the hyperparameter settings provided in Table 1. In Additional file 1: Figures S1 and S2, we provided the accuracies obtained for several other hyperparameters settings. These results show that the hyperparameter settings have a strong influence on the performance of LER models. Nevertheless, the gains in accuracies across the traits mentioned above are persistent for a wide range of hyperparameter values. Figures 4 and 5 and Table 4 show the results from the simulation experiment that was described in the Methods section and in Table 4. In Fig. 4, we present one example of the association results for the ordinary GWAS approach using a mixed model and the importance scores obtained from the LER model. In this figure, the vertical blue lines show the simulated QTL. A comparison indicates that the LER model can identify QTL that are missed in the additive GWAS (Fig. 4). Figure 5 also displays one example of the importance and interaction statistics for the first three PC and 27 SNPs that are deemed important by the importance statistic. The main effects of the SNPs are shown on the diagonal and the offdiagonal shows twoway interactions. The darker the colors, the more important are the effects. According to Table 3, this figure shows that the LER model captured the simulated additive and interaction effects. For example, ×8, ×11 and ×14 SNPs have additive simulated effects (Table 3) and they do not interact. The second genetic effects 208, 211 and 214 interact with the background (PC1) but have additive effects. In both cases, the LER model captured the simulated effects. In addition, for 100 independent replications of the same simulation experiment, Table 3 provides the counts of the number of times each of the 15 SNPs that generate a genetic value appears in the top 20 SNPs selected by LER versus by additive GWAS. The results also showed that LER is superior in identifying QTL, especially when interaction effects are involved (Additional file 2).
Discussion
This paper is an extension of a previous paper [40], in which we had defined a RKHS approach to capture epistasis between closelylinked markers (local epistasis) and we weighted the local estimates using an elastic net algorithm. Here, we propose a different approach that uses ensemble rules with the aim of capturing more complex interactions between predefined subsets of SNPs. First, the procedure extracts the locally epistatic effects using an ISLE ensemble rule and, then they are included into a standard random regression BLUP approach. The procedure is applied to simulated data and to four different datasets with satisfactory results when compared with GWAS or GBLUP. The focus of this article is on building locally epistatic models using rule ensembles. However, LER model building is a general methodology that includes three stages:

1.
Divide the genome into regions.

2.
Extract locally epistatic effects: Use the training data to obtain a model to estimate the locally epistatic effects.

3.
Process locally epistatic effects by combining them using an additive model.
At each of these model building stages, the researcher needs to make a number of decisions. For example, in all of our implementations of the LER models, we have used nonoverlapping contiguous regions. Nevertheless, the regions used in locally epistatic models can be overlapping or hierarchical. If some SNPs are associated with each other in terms of linkage or function, as for example through a known biochemical pathway, it might be useful to combine them together. For instance, the whole genome can be divided physically into chromosomes, chromosome arms or linkage groups. Further divisions could be based on recombination hotspots or just merely based on local proximity. We can also group SNPs based on their effects on intermediate traits such as lipids, metabolites, or gene expression. With the development of nextgeneration sequencing and genotyping approaches, large haplotype datasets are becoming available in many species. These haplotype frameworks provide substantial statistical power in association studies of common genetic variation across each region. The locally epistatic framework can be used to take advantage of annotations of the variants relative to the genes they are in or their predicted impact on protein function. It is possible to build LER models where each SNP defines a region by its neighborhood. This definition would give overlapping regions. We can supplement the rules with the SNP scores, and then the model fitting procedure will provide the appropriate coefficients for the rules and the linear terms. After extracting rules from a region, a variable selection procedure can be applied to pick the most relevant rules from that region. A regression of the response variable on the set of rules from a region using the elasticnet loss function allows us to control the number of rules selected as relevant for that region. In particular, the elasticnet algorithm uses a loss function that is a weighted version of lasso and ridgeregression penalties. If all the weight is put on the ridgeregression penalty, no selection will be applied to the input variables. On the other extreme, if all the weight is put on the lasso penalty this will give maximal sparsity. We have treated this parameter as a hyperparameter. The remaining parameters of the elasticnet regression were selected using crossvalidation. In some cases, a very large rule ensemble is required to obtain a competitive discrimination between signal and background and to obtain reliable importance statistics. When the number of rules extracted from the data is too large to handle then the relationship in Eq. (9) can be used to obtain the rule effects. If environmental covariates are observed along with the trait values then it is possible to include these variables with the SNPs in each region while extracting the rules. This will allow environmental main effects + genebyenvironment interaction terms to enter the model. Variables that measure background genetic variability related to the structure of the population can be incorporated into the model in the same way. In the examples below, we used the first three principal components of the marker matrix along with the marker matrix to account for the genomewide structural effects and the gene interactions. As mentioned previously, the best settings for the model, as determined by the best generalization performance, can be estimated via crossvalidation or other model selection criteria for each model fitting instance. These settings, in turn, might be indicative of the trait architecture. For example, increasing the “mean depth” parameter in the wheat dataset to allow higher order interactions deteriorated the model performance and this can be taken as an indication that for this dataset genetic effects are additive or interactions are of low order. Whereas, for the rice dataset, the best settings for the model have relatively high “mean depths”, possibly indicating that in addition to additive effects, there are high levels of genebygene, and genebybackground interactions in this dataset. We also presented accuracy results for some other settings of the hyperparameters of the LER algorithm in Additional file 1. The results of the simulated association experiment show that the importance and the interaction scores can be used to identify interesting loci. The comparisons with the standard additive mixed model GWAS showed that the LER methodology was superior: it detected loci that were not detected by the mixed model and at the same time provided a measure of the interactions between different types of input variables. We were able to recover most genebygene and genebybackground interactions with the LER model. We also described how this methodology can be used to study other forms of interaction. Finally, we highlighted some other strengths that are specific to the LER models:

The method can incorporate SNP annotations if they are used to partition predictors into “regions”.

Importance scores for regions, SNPs, and rules are available as a model output.

The need to impute the SNP data is reduced: the model is robust to missing observations in the dataset.

Markerbymarker interactions and even higher order interactions can be captured and interaction statistics are also available as a model output.

The model can be used to capture genebygenetic background or genebyenvironment interactions.
Conclusions
In this paper, we analyzed four real datasets over many traits, and also provided results from a simulation study. For most traits, accuracy gains using the LER model were consistent regardless of the hyperparameter values, e.g. several traits for the rice dataset, body weight for the mouse dataset, and days to anthesis for the maize dataset. We hypothesize that the LER model outperforms the additive model when the trait architecture involves local epistasis and gene–background interactions. For instance, the rice dataset has more family structure than the wheat dataset and it is reasonable to expect more genebygenetic background interactions in the former case. This could explain the differences in accuracy results between the additive GBLUP model and the LER model. We describe a new strategy for incorporating genetic interactions into genomic prediction and association models. This strategy results in accurate models, sometimes doubling the accuracies that can be obtained by a standard additive model.
References
 1.
Provine WB. The origins of theoretical population genetics: with a new afterword. Chicago: University of Chicago Press; 2001.
 2.
Fisher RA. The correlation between relatives on the supposition of mendelian inheritance. Tran R Soc Edinb. 1918;52:399–433.
 3.
Mackay TF. The genetic architecture of quantitative traits. Ann Rev Genet. 2001;35:303–39.
 4.
Holland JB. Genetic architecture of complex traits in plants. Curr Opin Plant Biol. 2007;10:156–61.
 5.
Flint J, Mackay TF. Genetic architecture of quantitative traits in mice, flies, and humans. Genome Res. 2009;19:723–33.
 6.
Barton NH, Turelli M. Evolutionary quantitative genetics: how little do we know? Annu Rev Genet. 1989;23:337–70.
 7.
Bernardo R. Molecular markers and selection for complex traits in plants: learning from the last 20 years. Crop Sci. 2008;48:1649–64.
 8.
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genomewide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106:9362–7.
 9.
Donnelly P. Progress and challenges in genomewide association studies in humans. Nature. 2008;456:728–31.
 10.
Bush WS, Moore JH. Genomewide association studies. PLoS Comput Biol. 2012;8:e1002822.
 11.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, et al. Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9:356–69.
 12.
MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRIEBI Catalog of published genomewide association studies (GWAS Catalog). Nucleic Acids Res. 2016;45(D):896–901.
 13.
Maher B. The case of the missing heritability. Nature. 2008;456:18–21.
 14.
Cloney R. Complex traits: integrating gene variation and expression to understand complex traits. Nature Rev Genet. 2016;17:194.
 15.
Fisher RA. The genetical theory of natural selection: a complete. variorum ed. Oxford: Oxford University Press; 1930.
 16.
Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86:6–22.
 17.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.
 18.
Routman EJ, Cheverud JM. Gene effects on a quantitative trait: twolocus epistatic effects measured at microsatellite markers and at estimated QTL. Evolution. 1997;51:1654–62.
 19.
Kondrashov AS. Deleterious mutations and the evolution of sexual reproduction. Nature. 1988;336:435–40.
 20.
Anderson VL, Kempthorne O. A model for the study of quantitative inheritance. Genetics. 1954;39:883.
 21.
Kempthorne O. The correlation between relatives in a random mating population. Proc R Soc Lond B Biol Sci. 1954;143:103–13.
 22.
Peiffer JA, Romay MC, Gore MA, FlintGarcia SA, Zhang Z, Millard MJ, et al. The genetic architecture of maize height. Genetics. 2014;196:1337–56.
 23.
Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, et al. TASSELGBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9:e90346.
 24.
Isidro J, Jannink JL, Akdemir D, Poland J, Heslot N, Sorrells ME. Training set optimization under population structure in genomic selection. Theor Appl Genet. 2015;128:145–58.
 25.
Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E, et al. Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet. 2015;11:e1004982.
 26.
Begum H, Spindel JE, Lalusin A, Borromeo T, Gregorio G, Hernandez J, et al. Genomewide association mapping for yield and other agronomic traits in an elite breeding population of tropical rice (Oryza sativa). PLoS One. 2015;10:e0119873.
 27.
Valdar W, Solberg LC, Gauguier D, Cookson WO, Rawlins JNP, Mott R, et al. Genetic and environmental effects on complex traits in mice. Genetics. 2006;174:95984.
 28.
Wimmer V, Albrecht T, Auinger HJ, Wimmer MV. Package synbreedData; 2015.
 29.
Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Austria: Vienna; 2013. p. 2014.
 30.
Heslot N, Yang HP, Sorrells ME, Jannink JL. Genomic selection in plant breeding: a comparison of models. Crop Sci. 2012;52:146–60.
 31.
Whittaker JC, Thompson R, Denham MC. Markerassisted selection using ridge regression. Genet Res. 2000;75:249–52.
 32.
Meuwissen T, Hayes B, Goddard ME. Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001;157:1819–29.
 33.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 1996;58:267–88.
 34.
Park T, Casella G. The bayesian lasso. J Am Stat Assoc. 2008;103:681–6.
 35.
Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R. Additive genetic variability and the Bayesian alphabet. Genetics. 2009;183:34763.
 36.
Sorensen D, Gianola D. Likelihood, Bayesian, and MCMC methods in quantitative genetics. New York: Springer; 2007.
 37.
Gianola D, Van Kaam JB. Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics. 2008;178:2289–303.
 38.
De Los Campos G, Gianola D, Rosa G. Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. J Anim Sci. 2009;87:1883–7.
 39.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
 40.
Akdemir D, Jannink JL. Locally epistatic genomic relationship matrices for genomic association and prediction. Genetics. 2015;199(3):857–71.
 41.
Friedman JH, Popescu BE. Importance sampled learning ensembles. J Mach Learn Res. 2003;9:4305.
 42.
Akdemir D, Jannink JL. Ensemble learning with trees and rules: supervised, semisupervised, unsupervised. Intell Data Anal. 2014;18(5):857–72.
 43.
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.
 44.
Ho TK. Random decision forests. In: Proceedings of the third international conference on document analysis and recognition, 1995, 14–16 August 1995; Montreal. IEEE; 1995. p. 278–82
 45.
Seni G, Elder JF. Ensemble methods in data mining: improving accuracy through combining predictions. Synth Lect Data Min Knowl Discov. 2010;2:1–126.
 46.
Hastie T, Tibshirani R. Generalized additive models. Stat Sci. 1986;1:297–310.
 47.
Zhou X, Stephens M. Genomewide efficient mixedmodel analysis for association studies. Nat Genet. 2012;44:821–4.
 48.
Breiman L. Classification and regression trees. London: Chapman and Hall/CRC; 1984.
 49.
Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.
 50.
Quinlan JR. C4. 5: Programs for empirical learning; 1994.
 51.
Therneau T, Atkinson B, Ripley B. rpart: Recursive partitioning and regression trees; 2015. R package version 4.110. https://CRAN.Rproject.org/package=rpart.
 52.
Friedman JH, Popescu BE. Predictive learning via rule ensembles. Ann Appl Stat. 2008;2:916–54.
 53.
Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. Ann Arbor: The University of Michigan Press; 1975.
 54.
VanRaden P. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23.
 55.
Wu H, Zhang JT. Nonparametric regression methods for longitudinal data analysis: mixedeffects modeling approaches. Hokoben: Wiley; 2006.
 56.
Hajjem A, Bellavance F, Larocque D. Mixed effects regression trees for clustered data. Stat Probab Lett. 2011;81:451–9.
 57.
Meinshausen N, Bühlmann P. Highdimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–62.
 58.
Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc Ser B. 2010;72:417–73.
 59.
Mingers J. An empirical comparison of pruning methods for decision tree induction. Mach Learn. 1989;4:227–43.
 60.
RodgersMelnick E, Bradbury PJ, Elshire RJ, Glaubitz JC, Acharya CB, Mitchell SE, et al. Recombination in diverse maize is stable, predictable, and associated with genetic load. Proc Nat Acad Sci USA. 2015;112:3823–8.
Authors' contributions
DA derived the statistics, prepared the programs and wrote the manuscript. JLJ and JIS helped write the article, and during the review process. All authors read and approved the final manuscript
Acknowledgements
DA: I want to acknowledge my late father, Mehmet Ali Akdemir, who always stood by me and supported me in my endeavors. His memory will always be beside me.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The data and compute programs used in this manuscript are available from the corresponding author on request.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This research was supported by the USDANIFAAFRI Triticeae Coordinated Agricultural Project, Award No. 20116800230029.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional files
12711_2017_348_MOESM1_ESM.pdf
12711_2017_348_MOESM2_ESM.txt
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI