Genetic analysis of infectious diseases: estimating gene effects for susceptibility and infectivity

Background Genetic selection of livestock against infectious diseases can complement existing interventions to control infectious diseases. Most genetic approaches that aim at reducing disease prevalence assume that individual disease status (infected/not-infected) is solely a function of its susceptibility to a particular pathogen. However, individual infectivity also affects the risk and prevalence of an infection in a population. Variation in susceptibility and infectivity between hosts affects transmission of an infection in the population, which is usually measured by the value of the basic reproduction ratio R0. R0 is an important epidemiological parameter that determines the risk and prevalence of infectious diseases. An individual’s breeding value for R0 is a function of its genes that influence both susceptibility and infectivity. Thus, to estimate the effects of genes on R0, we need to estimate the effects of genes on individual susceptibility and infectivity. To that end, we developed a generalized linear model (GLM) to estimate relative effects of genes for susceptibility and infectivity. A simulation was performed to investigate bias and precision of the estimates, the effect of R0, the size of the effects of genes for susceptibility and infectivity, and relatedness among group mates on bias and precision. We considered two bi-allelic loci that affect, respectively, the individuals’ susceptibility only and individuals’ infectivity only. Results A GLM with complementary log–log link function can be used to estimate the relative effects of genes on the individual’s susceptibility and infectivity. The model was developed from an equation that describes the probability of an individual to become infected as a function of its own susceptibility genotype and infectivity genotypes of all its infected group mates. Results show that bias is smaller when R0 ranges approximately from 1.8 to 3.1 and relatedness among group mates is higher. With larger effects, both absolute and relative standard deviations become clearly smaller, but the relative bias remains the same. Conclusions We developed a GLM to estimate the relative effect of genes that affect individual susceptibility and infectivity. This model can be used in genome-wide association studies that aim at identifying genes that influence the prevalence of infectious diseases.


Background
New and existing infectious diseases represent a major and increasing threat to domestic plants and animals, and to humans. Infectious diseases of animals are a worldwide concern, particularly because of their effects on the productivity and welfare of livestock and also because of their zoonotic threats to human health. In spite of the availability of antibiotic and vaccine treatments, the undesirable environmental impact of antibiotic treatments, the rapid evolution of bacteria to develop resistance to antibiotics and of viruses to escape vaccine protection illustrate the need for additional control strategies that can provide a useful complement to the currently used interventions to control disease [1].
Host susceptibility and tolerance are two of the ways that individuals respond to pathogens. Several studies on the genetics of diseases in animals have shown that the host's susceptibility and tolerance to infectious diseases have a genetic basis, and thus that genotypic differences exist between individuals regarding their susceptibility and tolerance to infectious challenges [2]. A number of genome-wide association studies (GWAS) have reported single nucleotide polymorphisms (SNPs) associated with susceptibility to various infectious diseases [3,4].
Most genetic approaches that aim at reducing the prevalence of an infection assume that an individual's disease status (infected/not-infected) is solely a function of its own genes and of non-genetic factors [2]. Hence, these methods capture only the genetic variation in susceptibility or tolerance (strictly, this latter statement is restricted to the measurement of disease occurrence in groups of unrelated individuals [5]). However, the prevalence and dynamics of an infection depend also on the infectivity of infected individuals in the population. Moreover, accumulating evidence on the existence of "superspreaders" in the outbreaks of epidemics suggests that (phenotypic) variation in infectivity exists among hosts [6]. Thus, the classical quantitative genetic approach of disease analysis based on individual disease status will capture only part of the heritable variation that is present in the host population and affects the dynamics of infectious diseases [7].
Between-host variation in susceptibility and infectivity affects the transmission of an infection in the population. This effect is measured by the value of the basic reproduction ratio R 0 . R 0 is defined as the average number of secondary cases produced by one typical infectious individual during its entire infectious lifetime, in an otherwise naïve population [8]. R 0 has a threshold value of 1, which implies that a major disease outbreak or a stable endemic equilibrium can only occur when R 0 is greater than 1. When R 0 is less than 1, the epidemic will die out. Thus, in order to reduce disease incidence and therewith prevalence, breeding strategies should aim at reducing R 0 , preferably to a value less than 1.
Genetic improvement that aims at reducing R 0 should be based on individual breeding values for R 0 . An individual's breeding value for R 0 is the sum of the average effects of its alleles on R 0 [5], which means that investigating the effects of genes on R 0 is relevant. Anche et al. [5] showed that an individual's breeding value for R 0 is a function of its genotype for susceptibility and infectivity, and of the population's average susceptibility and infectivity. Thus, in order to estimate effects of genes on R 0 , the susceptibility and infectivity effects of the different alleles must be estimated.
Disease data are often available only in binary form (0/1) i.e. the value indicates whether an individual has become infected or not. Hence, methods for genetic analyses of disease traits have to be tailored to such data. Generalized linear models (GLM) are commonly used to analyse binary data, where the expected value of the binary response variable is linked to the explanatory variables (traits) by a linear equation after applying a link function [9]. Velthuis et al. [9] showed that the effect of susceptibility and infectivity of hosts on the transmission rate parameter β can be estimated by fitting a GLM with a complementary log-log link function to binary disease data. Lipschutz-Powell et al. [10] showed that a GLM with a complementary log-log link function can be used to link the probability of an individual to be infected to the susceptibility genotype of the individual itself and the infectivity genotypes of its infectious contacts. However, they observed that the infectivity component of the model was non-linear, and did not provide an explicit GLM or investigate the quality of estimates resulting from such a GLM.
In this study, we developed a GLM to estimate the relative effects of genes on individual susceptibility and infectivity, and investigated the quality of the resulting estimates in terms of bias and precision. We also investigated the effect of R 0 , different sizes of the effects of susceptibility and infectivity genes and population structure with respect to relatedness on bias and precision of the estimates. The GLM was fitted to binary disease data (0/1) recorded at the end of the epidemic. Thus, the data analysed were counts of infected individuals of different genotypes. These data were obtained from a simulated genetically heterogeneous population in which individuals differed in susceptibility and infectivity.

Population structure
We assumed a diploid population with between-host genetic heterogeneity in susceptibility and infectivity. We modelled genetic heterogeneity in this population using two bi-allelic loci, one locus for the susceptibility effect (γ ) with alleles G and g and susceptibility values γ G and γ g , and one locus for the infectivity effect (ϕ) with alleles F and f and infectivity values ϕ F and ϕ f , respectively. Both loci were assumed to have multiplicative allelic effects and the reason for this assumption is explained in the section "Generalized linear models".

Epidemiological model of disease dynamics
Disease dynamics that are caused by a microparasitic infection can be modelled with a basic compartmental stochastic susceptible, infected and recovered (SIR) model. In this model, two possible events can occur: infection of a susceptible individual, and recovery of an infectious individual [11]. With stochasticity, these events occur randomly at a certain rate (probability per unit of time) specified by the model parameters and the state variables. In the SIR-model, these parameters are the transmission rate parameter (β) for S → I with rate β SI N , and the recovery rate parameter (α) for I → R with rate αI, where N denotes population size, S the number of susceptible individuals and I the number of infectious individuals (in this study, we assumed that an individual will be infectious once it is infected, thus the terms infectious and infected will be used interchangeably; hence, the symbols S, I and R are used to denote both the disease status and the number of individuals with that disease status). The transmission rate parameter β describes the probability per unit of time for one infected individual to infect any other individual in a totally susceptible population [8,12] (this can be seen from the transmission rate dS/dt = −βSI/N, for I = 1 and S = N).
In the following, we will consider binary data at the end of an epidemic, which indicates for each individual whether it has become infected or not. Thus, binomial count data were available to quantify the occurrence of infected individuals according to genotype. As a step towards the GLM, first we derive the probability of an individual to become infected.
In a genetically heterogeneous population, the transmission rate parameter β varies between pairs of individuals, and in addition to the contact rate (c), it will depend on the infectivity genotype of the infectious individual, and on the susceptibility genotype of the recipient susceptible individual. The assumption that the transmission rate depends only on the infectivity of the infectious individual and the susceptibility of the recipient individual, and not on the combination of these two traits, is known as separable mixing [8]. In other words, the two individuals that are in contact influence the transmission rate independently. Thus, the transmission rate of a specific susceptible individual with susceptibility genotype i from being susceptible to being infected when exposed to a single infectious individual with infectivity genotype j can be defined as: where γ i denotes the susceptibility of the susceptible individual, and ϕ j denotes the infectivity of the infectious individual. Note that the transmission rate in Eq. (1) refers to a single specific susceptible individual, whereas the transmission rate parameter β defined above, refers to any susceptible individual among the N candidates. Hence, they differ by a factor of N. In Eq. (1), c represents the average contact rate between any pair of individuals and thus c/N is the average contact rate of a susceptible with a single infectious individual in a group of size N (this assumes faecal-oral transmission or similar routes, where 1/N of the infectious material ends up with the sender itself ). Any variation in contact rate among different types of susceptible and infectious individuals is included in γ i and ϕ j because of the assumption of separable mixing. When one susceptible individual with susceptibility genotype i is exposed to one infectious individual with infectivity genotype j, the expected number of transmissions is the product of the transmission rate and the average length of the infectious period, and is equal to γ i ϕ j c 1 N 1 α , where 1/α is the average length of the infectious period. The probability P ij that the individual escapes infection follows from the zero term of the Poisson distribution, and is equal to: Here, it is assumed that the transmission rate parameter β (and thus also γ, ϕ, and c/α) is constant over time so that there is no over-dispersion and the Poisson distribution can be used.
At the end of the epidemic, the individual with susceptibility genotype i has been exposed not to only one but to all infectious group mates (strictly speaking this is true for the individuals escaping infection only). These group mates can be categorized by their infectivity genotype, j. Let I j denote the number of infected individuals with infectivity genotype j that have become infected during the epidemic and have infectivity ϕ j . Then the probability P i that the individual escapes all infection exposures by individuals of infectivity genotype j and still be susceptible by the end of the epidemic is equal to: Thus, the probability P i that the individual with susceptibility genotype i escapes all infection exposures from all genotypes and still be susceptible by the end of an epidemic is equal to the product of all the probabilities that it escapes infection exposures from its infectious group mates of each genotype: where the summation is over the n infectivity genotypes; n = 3 for a single bi-allelic locus in a diploid population.
In Eq. (3), we can replace I j by I × f j , where I is the total number of individuals that have been infected at the end of the epidemic and f j is the fraction of infected individuals of genotype j. This yields: From Eq. (4), the probability that a susceptible individual with susceptibility genotype i has been infected by the end of the epidemic is equal to: Thus, the probability that a susceptible individual has been infected depends on its own susceptibility, γ i , and on the arithmetic mean infectiousness n j=1 f j ϕ j of its I infectious group mates with different infectivity values ϕ j , with j = 1, … n.
In [13], equation 10, which is equivalent to our equation (5), was presented as the final size equation for a population that is heterogeneous for susceptibility and infectivity (in epidemiology, the so-called final size equation gives the fraction of infected individuals of each type by the end of an epidemic). Our equations 5 and 14 in [10] follow a similar derivation but, in our case, the equation is applied to the end of the epidemic.

Generalized linear model (GLM)
A GLM, in its simplest form, specifies a linear relationship between a function of the mean of the observed variable y, and a set of observed predictor variables, x: where φ is the so-called link function, c 0 is the intercept and the c i are the regression coefficients for the explanatory variables x i , for i = 1, … n. The aim is to estimate c i coefficients.
For binomial data where the probability of failure (to escape an infection) P is equal to the zero term of a Poisson distribution, as in the above Eq. (4), the complementary log-log link function is the default link function to connect explanatory variables x i with the observed variable y of the linear model [14]. Applying the complementary log-log link function to 1 − P i based on Eq. (4), yields: Thus, the dependent variables have now become the fraction of each i type of individual that did become infected (see below).
The model in Eq. (6) is linear in log of susceptibility (γ i ) but not for infectivity (ϕ j ), since the logarithm of a sum does not equal the sum of the logarithms, as also observed by [11]. In Eq. (6), the term n j=1 f j ϕ j can be recognized as the arithmetic mean, since n j=1 f j = 1. In order to further linearize Eq. (6), the arithmetic mean was approximated by a geometric mean, using the substitution n j=1 f j ϕ j ≈ n j=1 ϕ f j j . This yields: The approximation of the arithmetic mean regression by a geometric mean regression was investigated separately as explained in the "Appendix". In Eq. (7), the expectation of the response variable, cloglog(1 − P i ) is a linear expression of log(γ i ) and log(ϕ j ).
Equation (7) is linear in the log of susceptibility (γ i ) and the log of infectivity (ϕ j ). To be able to formulate the model in terms of allele counts within individuals, rather than in terms of individual genotypes, it was assumed that the two alleles that make up the genotype within an individual act multiplicatively, so that their effects are additive on the log-scale.
Therefore, the genotypic values will be Furthermore, the effects of the g and f alleles were set to a value of 1, γ g = 1 = ϕ f = 1, so that log(γ g ) = log(ϕ f ) = 0. This is done without loss of generality, because the interest lies in the relative effect of one allele to the other, that is the effect of γ G relative to γ g and the effect of ϕ F relative to ϕ f [note that this does not affect the estimates of relative allele effects since the absolute scale of the model is accounted for by the log(c/ α)-term]. Using Eq. (7), the GLM for the diploid genetic model becomes: where individuals are aggregated by their genotype, i. The cloglog is applied to the expectation of y i n i , which is the fraction of infected individuals of genotype i, by the end of the epidemic and y i follows a binomial distribution, c 0 is the intercept measuring log(c/α), and c 1 is the regression coefficient for the index G , where index G,i = 0, 1 or 2 is the number of G alleles at the susceptibility locus of individuals of genotype i. The c 2 is the regression coefficient for Num F , which is the average of the number of F-alleles per individual at the infectivity locus in the infected group mates of the individuals of genotype i. It is calculated as 2 × Frac FF + 1 × Frac fF /Ff where Frac FF is the fraction of infected individuals with genotype "FF " and Frac fF /Ff is the fraction of infected individuals with genotype "fF" or "Ff". The "2" arises because individuals with the "FF " genotype carry two F alleles, while those with the "fF" or "Ff" genotype carry only one F allele. The log I N corresponds to the total fraction of infected individuals in the group, which is used as an offset in the GLM. Hence, estimates of c 1 and c 2 refer to the effect of a single allele, and represent the so-called average effect of an allele substitution on the log-scale [15]. When fitting the model to binomial count data of those individuals of each genotype that are infected and estimating c 0 , c 1 and c 2 , the effects of alleles G and F relative to γ g = ϕ f = 1 can be calculated as γ G = e c 1 and ϕ F = e c 2 , respectively.

Simulation
To investigate the bias and precision of the γ G and ϕ F , one generation of a diploid population was simulated based on the above assumptions with respect to the effect of alleles at both loci. These two loci were the only genetic effects simulated. Furthermore, it was assumed that allele frequencies at both loci were equal to 0.5, that is, p g = p f = 0.5. The population was sub-divided into 100 groups of 100 individuals each. Each group was set up in such a way that group mates showed a certain genetic relatedness, r, at both loci. Here, relatedness is defined as the correlation of allele counts between group mates, irrespective of what causes the correlation. To limit the number of scenarios to be tested, relatedness at the susceptibility locus, r γ , and at the infectivity locus, r ϕ , were assumed to be the same (note that relatedness at both loci is expected to be the same when the loci are not under selection). In order to have a certain degree of relatedness among group mates, a fraction of fully related individuals was added to each group, supplemented by randomly selected individuals. Since each individual carries both the susceptibility and the infectivity locus, these additions were done jointly (see Appendix 4 in [5] for a detailed description of the strategy to make these additions jointly).
A basic stochastic SIR-model as described above was used to simulate the disease dynamics [12]. In each group, the epidemic began by one randomly infected individual. Then, the next event which could be either infection of a susceptible individual or recovery of infected individual was determined using Gillespie's direct algorithm [16]. The type of event, i.e. either infection or recovery, was decided by drawing a random number v 1 , from a uniform distribution, v 1 ~ U(0,1). The next event was an infection of a susceptible individual if the random , otherwise it was recovery of an infected individual. The numerator of this ratio represents the total infection rate, and the denominator the total rate, i.e., the sum of the infection and recovery rates. The sampling of the specific individual that became infected depended on individual susceptibility. The probability that a susceptible individual of genotype i became infected was proportional to . Hence, the transmission rates were updated based on the numbers of susceptible and infected individuals of each genotype, while the transmission rate parameter β ij remained constant. The epidemic ended when there was no more infectious individual in the population or when there was no susceptible individual left to be infected. By the end of the epidemic, the number of individuals that got infected together with their genotypes for susceptibility and infectivity were recorded. The fraction of individuals of each genotype that got infected was the dependent variable in the analysis.
We hypothesised that different epidemiological and genetic factors will affect the quality of the estimates, as measured by the bias and precision of γ G and ϕ F . For that purpose, we simulated different scenarios that are described below. The biases of the estimates were calculated by taking the difference between the 'true' and estimated values and the precision of the estimates were calculated using the standard deviation (SD) of the estimates.
First, we simulated a basic scenario (scenario 1; Table 1), in which groups were created randomly with respect to relatedness among group mates. We calculated R 0 using [5]: where and Population parameters are in Table 1. In the basic scenario, R 0 was set to 1.2.
Second, to investigate the effect of R 0 on the quality of γ G and ϕ F , we simulated scenarios with different values of We varied the contact rate c, so that R 0 for a population consisting of groups with unrelated individuals varied from 0.6 (for which no major outbreaks can occur) to 6.1 (for which major outbreaks can occur; Table 1, scenario 2). Third, to investigate the impact of the size of effects of the genes for susceptibility and infectivity on the quality of γ G and ϕ F , we simulated scenarios with different effect sizes for a constant value of R 0 = 1.2. We simulated all combinations of low, moderate and high values for γ G and ϕ F ( Table 1, scenario 3).
Furthermore, in all of the above-mentioned scenarios, relatedness between group mates was varied between 0 and 1 to investigate the effect of population structure with respect to relatedness on the quality of γ G and ϕ F . Relatedness was assumed to be the same at both loci (see [5] for details). We used R software to fit the model with a glm function and a binomial distribution.

Results
All estimates presented in this section are averages from 2000 replicates, except for Fig. 1 which shows the results of all replicates. The black straight line in all figures represents the true difference between γ g and γ G and between ϕ f and ϕ F , and the bars indicate the standard deviation of these estimates among replicates.
In the basic scenario, in which groups were created randomly with respect to relatedness, r = 0, we found that the susceptibility effect was slightly underestimated (1 − γ G in Fig. 2) but the infectivity effect was considerably overestimated (1 − ϕ F in Fig. 2). When the degree of relatedness among group mates increased, the bias of both estimates decreased, however, the effect of relatedness was more pronounced for infectivity (Fig. 2). The error in ϕ F , that is caused by the geometric mean approximation was quantified and found to be small (Table 3, "Appendix"). Moreover, the standard deviation of the estimated susceptibility effect increased only slightly, whereas the standard deviation of the estimated infectivity effect increased considerably as the degree of relatedness increased.
A scatter plot for (1 − γ G ) and (1 − ϕ F ) of the 2000 replicates for the basic scenario where r = 0 shows that the estimated differences are uniformly distributed over their range without any pattern (Fig. 1). This plot also shows that (1 − ϕ F ) is more often underestimated than overestimated, which agrees with the underestimation in Fig. 2 for r = 0.
In the second set of scenarios, where R 0 was varied from 0.6 to 6.1, susceptibility and infectivity effects were also underestimated. Bias in γ G and ϕ F was smallest for values of R 0 that ranged approximately from 1.8 to 3.1. Higher values of R 0 increased bias in γ G but had little effect on bias in ϕ F when group mates were unrelated (Fig. 3, panel a). Bias in ϕ F and γ G decreased with increasing relatedness among group mates, except for ϕ F at high values of R 0 (Fig. 3, panels b-d). In contrast to the result for the unrelated groups, bias in ϕ F was larger at high values of R 0 when related groups were used (Fig. 3, panel a  vs. b-d). For fully-related groups, i.e. r = 1, estimates for ϕ F and γ G and their standard deviation were nearly identical (Fig. 3, panel d). For this scenario, the error in ϕ F as a result of the geometric mean approximation was also quantified and only a small error was found ("Appendix", Table 4).
For all values of R 0 , standard deviations of estimates were greater for infectivity effect than for susceptibility effect, except for r = 1 for which they were nearly identical. Standard deviations decreased considerably as relatedness among group mates increased, particularly for infectivity effect. For both susceptibility and infectivity effects, standard deviations were smaller for values of R 0 for which the bias in γ G and ϕ F was smallest, i.e. when R 0 ranged approximately from 1.8 to 3.1.
In the third set of scenarios, different sizes of the effects of γ G and ϕ F were simulated. For both estimates, the relative bias did not change regardless of the size of the effect considered (Figs. 4, 5). In these scenarios also, both susceptibility and infectivity effects were underestimated regardless of the size of the effects considered, except when there was a large difference in infectivity effect and r = 1, there was a small overestimation (1 − ϕ F in Fig. 5). Moreover, smaller relative standard deviations were found for both susceptibility and infectivity effects when effect sizes were larger, which indicates that the effects are better estimated when they are larger. For this scenario, the error in ϕ F as a result of the geometric mean approximation was also quantified and only a small error was found (Table 5, "Appendix").

Discussion
In this work, a generalized linear model with a complementary log-log link function was developed to estimate the relative effects of genes on individual susceptibility and infectivity. The model was developed from an equation that describes the probability of an individual to become infected as a function of its own susceptibility genotype and of the infectivity genotypes of its infected group mates. This GLM was developed following Velthuis et al. [9] who developed a GLM for binary data on a transmission trial to estimate the effect of susceptibility and infectivity of hosts on the transmission rate parameter β. A simulation study was performed to investigate the quality of the GLM. From the statistical analysis of the simulated data, we obtained fairly precise estimates, except for some scenarios for which estimates were more biased, particularly for infectivity. The best estimates were found for schemes with intermediate R 0 and related group members. For all the scenarios investigated, the sizes of the effects at both loci were underestimated.
The main objective of this study was to develop a methodology to estimate gene effects and also to investigate its quality in terms of bias and precision of the estimates. To test the methodology without introducing additional assumptions that may contribute to estimation error, we assumed additive allele effects on the log-scale for both susceptibility and infectivity. Thus, allelic effects were simulated multiplicatively on the original scale. This was  done for two reasons. First, we wanted to formulate the model in terms of allele counts within individuals, rather than in terms of individual genotypes. In other words, we did not intend to estimate dominance effects. Whether allele effects are more likely to be additive on the logscale than on the original scale is unknown at present. Second, since the objective of this study was to investigate the quality of the model rather than the assumptions on the genetic architecture, the data were simulated under a model that agreed with the assumptions of the statistical model.
Bias and standard deviation of the estimates were smallest for R 0 that ranged approximately from 1.8 to 3.1. The basic reproduction ratio R 0 is an important factor that affects the size of an epidemic in a population, i.e. the fraction of individuals that are found to be infected by the end of an epidemic. When R 0 is greater than 1 but near 1 in a group, there will be virtually no individuals infected and thus, there is hardly any variation in disease status, which results in inaccurate estimates of gene effects. Conversely, when R 0 is much greater than 1, nearly all individuals will be infected, which again results in very little variation in disease status. ( Table 2 indicates the fraction of infected individual for different values of R 0 and relatedness among group mates). Thus, the relationship between R 0 and the fraction of individuals infected affects the estimation of the effect on susceptibility and infectivity , since data on the final size of an epidemic were used for our estimation. This mechanism may explain why the estimated effect on susceptibility is best for intermediate R 0 . The effect of infectivity is more difficult to estimate and the bias is larger.
For each scenario, more relatedness between individuals resulted in better estimates for both traits. This is because more relatedness creates more variation between groups, which results in groups with below or above average susceptibility and/or infectivity. This occurs because an individual with a lower susceptibility will also have related group mates with below average susceptibility,  and vice versa. The same applies for infectivity. However, since we assumed absence of linkage disequilibrium (LD) between the susceptibility and infectivity loci, groups with below average susceptibility will not always have below average infectivity as well. Thus, only those groups with above average susceptibility and above average infectivity will have epidemics with a greater final size, i.e. the fraction of individuals that gets infected by the end of the epidemic, while those with below average susceptibility and infectivity will have a lower final size. This variation improved estimates of the effects of susceptibility and infectivity. We have made a number of assumptions in building our methodology. In the derivation of Eq. (5), we assumed that all individuals that escaped the infection had been exposed to all infected individuals. Of course, this assumption is true for the simulations done here.
To what extent, this will be true for real data remains to be seen. It seems reasonable to assume that individuals in relatively small and well-defined groups get mixed up over space and time as is often the case in animal husbandry: for example, in fattening pigs with group sizes of 10 to 30 individuals. The assumption is less reasonable for groups with a spatial structure, for example in tie stalls or when epidemics occur within barns subdivided into multiple groups. In such cases, data should be collected separately for different groups. We also assumed that epidemics could be completely recorded, so that the final disease status of all individuals is known, and all individuals that have escaped the infection have been exposed to all infected individuals. However, for reasons of, e.g., animal welfare and productivity, interventions are often carried out to limit the size of an epidemic. Hence, individuals may not have had the full potential to express their susceptibility and infectivity. For incomplete epidemics, the probability that an individual becomes infected follows from Eq. (5) when only the infected individuals to which the focal individual has been exposed are considered (see also [11]). Thus, extension to incompletely observed epidemics is straightforward (see also application in [_ENREF_189] and subsequent papers citing [9]).
Bias and precision of estimates may be improved when data are recorded within shorter time intervals. This may be particularly helpful for cases with high R 0 . In such cases, each interval forms an incompletely observed epidemic, which can be analysed with the same GLM statistical approach [9]. When data are collected in sufficiently short time intervals, only a fraction of individuals will become infected in a single interval, even when R 0 is high. This will contribute to accuracy of the estimates. Moreover, collecting data in short time intervals also provide information on the order of infections, i.e., which animal has infected which animal. This will increase the accuracy of estimated gene effects, particularly for infectivity [17]. Thus, using data from short time intervals can be complementary to using groups composed of related individuals and data from multiple epidemics. The derivation and resulting model for such cases is very similar to the one presented here, since the probability that an individual escapes infection follows from the zero-term of the Poisson distribution (see also [9,11]). The key step is to identify the infectious individuals to which the focal individual has been exposed in a time period.
Lipschutz-Powell et al. [11] showed that, when there is genetic variation in susceptibility only, a complementary log-log link function can be used to link an equation that describes the probability of an individual to become infected to a linear model that includes the individual's genotype for susceptibility. They also suggested that, when there is genetic variation in infectivity, a Taylorseries expansion of the model term for infectivity can be used to further linearize the model in infectivity. In our study, we obtained a linear model for infectivity by approximating the arithmetic mean by a geometric mean. We quantified the error due to this approximation and found only negligible errors in the estimates ("Appendix"). Thus, this approximation can be ruled out as the cause of the observed bias. This suggests that, for cases for which there is variation in infectivity, the geometric mean approximation is suitable to obtain a linear combination of the parameters of interest. A full investigation of the causes of the bias is beyond the scope of this study. However, the fact that a population of finite size, i.e., 100 individuals in each group, was used to estimate gene effects can be one of the reasons for the observed underestimation.
Anche et al. [5] defined breeding value and heritable variation in R 0 . They showed that an individual's breeding value for R 0 is a function of the population's average susceptibility and infectivity, of the gene frequencies within the individual and of average effects of the alleles at both loci (Equation 7c in [5]). However, Anche et al. [5] assumed that effects of alleles at both loci were additive, whereas here we assumed that effects are multiplicative (so that they are additive on the log scale). Multiplicative effects introduce dominance. Hence, before applying the expressions for breeding value and heritable variation of [5] to estimates obtained from the methods proposed here, they need to be translated into average effects of alleles [15]. Using the common notation for the one-locus model [15], the additive effect is half the difference in genotypic value between both homogyzotes, a γ = (γ 2 g − γ 2 G )/2 and a φ = (φ 2 f − φ 2 F )/2, the dominance deviation is the difference between the heterozygote and the average of both homozygotes, d γ = γ g γ G − (γ 2 g + γ 2 G )/2 and d ϕ = ϕ f ϕ F − (ϕ 2 f + ϕ 2 F )/2, and the average effects of alleles are given by α γ = a γ + (p G − p g )d γ and α ϕ = a ϕ + (p F − p f )d ϕ , where p denotes allele frequency [15]. Hence, in Eqs. 7 and 11 of [5], γ g − γ G should be replaced by α γ , and ϕ f − ϕ F should be replaced by α ϕ . For example, for γ g = 1 and γ G = 0.6, genotypic values are γ gg = 1, γ gG = 0.6 and γ GG = 0.36, the additive effect is a γ = (1 − 0.36)/2 = 0.32, the dominance deviation is d γ = 0.6 − (1 + 0.36)/2 = −0.08, and the average effect is α γ = 0.32 − 0.08 (p G − p g ).
In this study, we assumed a model with two bi-allelic loci, i.e. one locus that affects individual susceptibility and one locus that affects individual infectivity. Furthermore, we assumed that which locus affects infectivity and which locus affects susceptibility, are known. This may be the case with candidate gene approaches which include only the genes for which the function is related to the trait of interest. The effect of the putative causative gene is then examined by association study. In such studies, the GLM developed here can be applied to estimate and confirm the effect of the candidate gene on the trait of interest. However, applying a candidate gene approach is limited because it relies on knowing the functional relation between the genes and the trait of interest. The recent advances in molecular genomics allow us to genotype individuals for thousands of SNPs, and to perform GWAS in which all SNPs are examined for their association with the trait of interest. The GLM developed here can also be used in GWAS that aim at identifying genes associated with susceptibility and/or infectivity. In such studies, it is not known whether a SNP affects infectivity and/or susceptibility. Hence, this has to be inferred from the significance of the estimated effects. To avoid the need to test all combinations of two SNPs, one could first screen SNPs for susceptibility effects, and then fit only the significant loci for susceptibility effects, together with all other loci for infectivity effects. Moreover, when modified so that gene effects are estimated as random effects, our model can probably be used for polygenic traits, for example in genomic prediction, for which effects of all genes are estimated simultaneously and the interest lies in predicting the breeding value of entire genotypes [18].

Conclusions
We have developed a generalized linear model to estimate the relative effects of genes on individual susceptibility and infectivity. This model may be used in genome-wide association studies that aim at identifying genes that are involved in the prevalence of infectious diseases.
In this Appendix, we address one issue regarding the quality of the two estimators, which we use to recover the genetic parameters. In general, one would like to have estimators that give consistent estimates of the parameters. This implies that both the variance and the bias of the estimators for a sufficiently large dataset (size n) can be brought arbitrarily close to zero. Expressed in formulas for the relative infectivity and the relative susceptibility, which are the two parameters that we want to estimate, these requirements look like this: In addition, one would like to know how fast the estimators approach these limits. That analysis is presented in the main text and is done by comparing simulations to the true values. There is, however, an issue with the asymptotic unbiasedness of the effect on infectivity (the first equation): the estimator for the effect of the relative infectivity is not unbiased, but instead we will show below that: and we will derive the expression for the function m(.) and will show that it is close to 1 and always smaller or equal to 1. Note that m(.) = 1 means no bias and m(.) < 1 means underestimation of the effect. As explained in the main text, the transmission rate parameter (β) is the product of the contact rate (c), susceptibility (γ) and infectivity (ϕ). Applying the complementary log-log link function results in Log(β) being in the expression for the expected value of the dependent variable. Thus, to see whether a linear relation is obtained between the explanatory variables to explain the expected value of the dependent variable, we can write that: The heterogeneity in Log(γ) is straightforwardly incorporated in the model since each recipient counted in the dependent variable is only one type of susceptible individual. Thus, take γ g = 1 and the other type has γ G , then: where Index G is equal to 1 if the recipient is G and 0 when the recipient is of type g, with additional modification for the three genotypes as explained in the main text. Thus, the estimated parameter is asymptotically unbiased using the GLM method.
For heterogeneity in ϕ, it is not straightforward because we are dealing with the arithmetic mean (ϕ AM ) across all types of infectious individuals in the populations as was derived in the main text. Let us again look at the case with only two types of infectious individuals: where p F is the explanatory variable (p f = 1 − p F ).
In order to obtain linearity in the explanatory variable for infectivity, ϕ AM is replaced by geometric mean ϕ GM with ϕ GM = n j=1 ϕ p j j . The equation with two types of infectious individuals becomes: This is a linear equation in p F , the explanatory variable, because p f = 1 − p F . Now, we calculate the systematic error (bias) made by the approximation of the arithmetic mean (ϕ AM ) by a geometric mean (ϕ GM )). For a bi-allelic genetic model, where there are two alleles, i.e. ϕ F and ϕ f , with a frequency p F and (1 − p F ), respectively, the Log(ϕ AM ) expression for the two alleles can be written as: Thus, the effect of ϕ F compared to ϕ f is measured by the coefficient of p F , i.e., the slope of the linear expression within the logarithm, but this is not a linear model. Note that if the number of data points (n) becomes larger and larger, the expected (average) observed values, i.e. the number of cases (y) among the number of susceptible (S), will after applying the cloglog link function become arbitrary close to the expression A1, or: To obtain a linear model, we take the Log(ϕ GM ) expression for the two alleles which can be written as: Now the effect of the allele ϕ F compared to ϕ f is measured by the ratio of the two values instead of the difference as in expression A1. This ratio can thus be calculated as the antilog of the regression coefficient of p F which is the explanatory variable. In other words, from the GLM in Eq. (8) in this paper, the estimated Log of the ratio of ϕ F over ϕ f is obtained from the regression coefficient c 2 .
Now the next issue that we address in this Appendix is to fit a linear equation for the Log(ϕ) as a function of allele frequency (p F ) which is: If, in fact, the transmission depended on the ϕ GM rather than on ϕ AM , we would have: However, since the data come from a process where the observed ϕ is in fact the ϕ AM , the resulting linear relationship will not (necessarily) have A = log ϕ F ϕ f . As we are interested in the allele effects, we need to estimate the slope of the line, i.e. the regression coefficient (A) of p F , and compare it to log ϕ F ϕ f . To determine A, we need to find the best fitting linear relationship for Log(ϕ LIN ) from Log(ϕ AM ) data (Fig. 6). This was done and we showed that this estimated A is very close to and we were able give an explicit expression for the bias with respect to this true value.

Derivation of the expression for fitted line through Log(ϕ AM )
The following shows how the A and B for the linear model in Eq. (A3) can be obtained when this linear model is fitted to data generated by the non-linear relation between the explanatory variable (p F ) and the observed effect Log(ϕ AM ).
For each value of p F , we observe a corresponding value for Log(ϕ AM ), which gives a nonlinear relationship [Eq. (A1); Fig. 6]. Thus, in order to obtain a linear relationship between the parameter of interest (p F ) and the dependent variable, we fit a line through this nonlinear relationship from which we estimate the effect (in this case, the Log of the effect of ϕ F compared to ϕ f ). To fit a line through the true relationship Log(ϕ AM ), we sample random values for p F from a uniform distribution from 0 to 1 and calculate corresponding values of Log(ϕ AM ) from Eq. (A1). If we draw a least squares regression line through the random numbers drawn from these [p F , Log(ϕ AM )] pairs, the line passes through the average values sampled: p F and Log(ϕ AM ). This allows us to find B, since we know that p F = 1/2, and Log(ϕ AM ) is: Hence, since Log(ϕ AM ) = A · p F + B, and p F = 1/2, B = Log(ϕ AM ) − 1 2 A, and thus Now we have an equation with only one unknown (A) and the solution for A, denoted A min , can be found by taking the least squares optimization. This means that we can find the minimum solution for the squared difference between the Log(ϕ AM ) and Log(ϕ LIN ) derived above.
This integral was evaluated with symbolic computer algebra using Mathematica. This is a straightforward evaluation but, at first, the expressions appear to be very big. Thus, we undertook some simplifications to find the A for which the minimum of the expression is attained. As the part between brackets is a linear expression in A, the result of the above equation is a quadratic equation in A, and thus can be written as: The equation between brackets is for an upward open parabola (if K 2 > 0) and the minimum of this parabola is attained for: And thus Then, both the numerator and denominator of the above equation were divided by ϕ 2 f and this resulted in: Thus, A min is a function of ϕ F ϕ f only. Note the similarity with Eq. (A2) where A = log ϕ F ϕ f . It should be noted that the A min is the estimate (C 2 ) that will be obtained asymptotically (i.e. when n → ∞) from the GLM in Eq. (8). Thus, we investigated the relation of this estimated value to the true expected value Log ϕ F ϕ f , to quantify the bias due to our approach.
Let us assume that ϕ F ϕ f = x, thus the above equation can be simplified as: , hence there is a non-zero bias. However, we can now define m(x) by A min = m(x) · log(x). The value of m quantifies the amount of relative bias that is obtained as a result of the geometric approximation; a value m = 1 indicates a zero bias. Thus: Equation (A4) quantifies the amount of bias, the magnitude of which is numerically investigated below. However, first it is necessary to check Eq. (A4) using some relationships that are known to hold for the underlying problem, for example: m(x) = m 1 x , since it should not matter which allele is coded F or f. lim x→1 m(x) = 1, since the arithmetic and geometric mean are identical when ϕ F = ϕ f . lim x→0 m(x) = 0, and lim x→∞ m(x) = 0, since we always underestimate the effect because 0 ≤ m(x) ≤ 1 and thus it seems that m(x) will have to approach zero when the real effect becomes infinitely large (i.e., either x = 0 or x → ∞). As a result, we will estimate a finite value for the effect even when the effect is infinite and, thus, we make an infinitely large error, i.e. m(x) = 0. All conditions hold as it can be checked using Eq. (A4).
Going back to our paper, we now look at Eq. (8), which is, where c 2 , is the regression coefficient that we estimate. In other words, when applying the geometric mean approximation, we assume Est(log(x)) = c 2 , whereas in fact, Log(x) = c 2 m(x) , when we correct for the geometric mean approximation.
Since we assumed that ϕ f = 1, then Log(x) = Log ϕ F ϕ f = Log(ϕ F ). Thus: where The result from Eq. (A6) quantifies the amount of error that was obtained as a result of the geometric approximation. An m(ϕ F ) = 1 indicates no error, an m(ϕ F ) < 1 indicates underestimation, while an m(ϕ F ) > 1 indicates overestimation of ϕ F . As 0 < m(ϕ F ) < 1, the estimated value is always too small. Hence, the geometric mean approximation is conservative. Furthermore, m(ϕ F ) = m 1 ϕ F for all ϕ F and m(1) = 1, the further ϕ F is away from 1 (the larger effect), the higher the error. Roughly speaking for values of ϕ F between 0.333 and 3, the error is smaller than 5 %; i.e., 0.95 < m < 1. Now that we have quantified the amount of bias [Eqs. (A5) and (A6)], we can obtain the correct value. Note that in Eq. (A5), the (true) value of ϕ F appears on both sides of the equation. Thus, we need an iterative procedure to obtain the real value. First, ϕ F is calculated by taking the cloglogE y i n i = c 0 + c 1 index G,i + c 2 Num F + log I N , exponential of c 2 from the GLM analysis. Then, the error m( ϕ F ) [Eq. (A6)] followed by the new value for in log(ϕ F ) in Eq. (A5) are estimated. ϕ F is then again calculated by taking the exponential of log(ϕ F ). This iteration process is then allowed to continue until there is no change in ϕ F . In the tables below, the biases obtained as a result of the geometric mean approximation are presented for the different scenarios investigated in the main text. This bias is calculated as the difference between ϕ F that is obtained after accounting for the error as a result of geometric mean approximation and ϕ F that is obtained when the error is not accounted for. Note that there is additional bias with respect to the true value which is of course known from the simulations. This bias is also small but larger than asymptotically expected from the GM approximation.