Research Article  Open Access
Gregory Nuel, Alexandra Lefebvre, Olivier Bouaziz, "Computing Individual Risks Based on Family History in Genetic Disease in the Presence of Competing Risks", Computational and Mathematical Methods in Medicine, vol. 2017, Article ID 9193630, 14 pages, 2017. https://doi.org/10.1155/2017/9193630
Computing Individual Risks Based on Family History in Genetic Disease in the Presence of Competing Risks
Abstract
When considering a genetic disease with variable age at onset (e.g., familial amyloid neuropathy, cancers), computing the individual risk of the disease based on family history (FH) is of critical interest for both clinicians and patients. Such a risk is very challenging to compute because the genotype of the individual of interest is in general unknown, the posterior distribution changes with ( is the age at disease onset for the targeted individual), and the competing risk of death is not negligible. In this work, we present modeling of this problem using a Bayesian network mixed with (rightcensored) survival outcomes where hazard rates only depend on the genotype of each individual. We explain how belief propagation can be used to obtain posterior distribution of genotypes given the FH and how to obtain a timedependent posterior hazard rate for any individual in the pedigree. Finally, we use this posterior hazard rate to compute individual risk, with or without the competing risk of death. Our method is illustrated using the ClausEaston model for breast cancer. The competing risk of death is derived from the national French registry.
1. Introduction
Complex diseases with variable age at onset typically have many interacting factors such as the age, lifestyle, environmental factors, treatments, and genetic inherited components. The genetic component is generally composed of one or several genes including major genes for which a deleterious mutation rises significantly the risk of the disease and/or minor genes which participation in the disease is moderate by itself.
The mode of inheritance can be monogenic if a mutation in a single gene is transmitted or polygenic if mutations in several genes are transmitted. As an example of a major gene in a complex disease, the BRCA1 gene is well known to be strongly correlated with ovarian and breast cancer since the 90s [1, 2]. Carriers of a deleterious mutation in BRCA1 gene have a much higher risk to be affected with relative risks ranging from 20 to 80 but deleterious mutations in BRCA1 gene only explain 5 to 10% of the disease [3] as many other implicated known or unknown genes exist along with sporadic cases (cases with no inherited component).
In other rare genetic diseases such as the Transthyretinrelated Hereditary Amyloidosis (THA), no sporadic cases are found and therefore the incidence is equal to zero among noncarriers and all affected individuals are necessarily carriers of a deleterious mutation [4, 5].
The family history (FH) of such diseases is often the first tool for clinicians to detect a family of carriers of a deleterious mutation as any unusual accumulation of cases in relatives leads to suspect a deleterious allele in the family. With the appropriate model and computation, the FH can be used to better target the most appropriate individuals for a genetic testing and/or to identify highrisk individuals who require special attention (monitoring and/or treatments).
The first challenge to compute such a model comes from the fact that genotypes are mostly (if not totally) unobserved and that posterior carrier probability computations must sum over a large number of familial founders’ genotypes configurations. Once such computations are carried out, deriving posterior individual disease risk is also a challenging task since the posterior carrier distribution changes over time and must be accounted for. Finally, for diseases with possibly late age at onset (e.g., cancer), the competing risk of death is not negligible and must be accounted for.
A competing risk situation occurs when an event (called a competing event) precludes the occurrence of the event of interest. This is typically the case for lateonset diseases as the risk of death is not negligible for advanced age. Ignoring the risk of death would amount to assuming that death cannot happen and would therefore lead to overestimating the cumulative incidence (the probability of having the disease before any time point). Famous examples of such situations include dementia where the patients are of a particularly advanced age and have a high risk of dying as in JacqminGadda et al. [6] or Wanneveich et al. [7], or studies on geriatric patients (see, e.g., [8]).
Classical familial risk models such as ClausEaston [9, 10], BOADICEA [11], or the BayesMendel models (BRCAPRO, MMRpro, PancPRO, and MelaPRO, see [12]) do not take into account the competing event of death. As a result, it is likely that individual predictions will tend to be overestimated from these models [13]. The main result of the present work is that we show how to derive individual risk predictions from the family history while taking into account the competing risk of death, which is a new contribution to the best of our knowledge.
Another interesting point is that, unlike most similar publications, we here provide all the necessary details to integrate the likelihood over the unobserved genotypes and to compute posterior genotype distributions using Bayesian network and sumproduct algorithms. One should not that these models and algorithms clearly are often used in the context of genetics (see [14–18], for a few examples) but rarely fully detailed (see, e.g., [12]).
It should also be noted that the genetics community usually prefers to rely on simple peeling algorithms rather than Bayesian network for pedigree computations but the two concepts are in fact totally equivalent, and the sumproduct algorithm presented in this paper can indeed be seen as a simple Bayesian network based reformulation of the most general peelingbased algorithm developed so far [19].
The paper is organized as follows: firstly, in Section 2.1 we introduce a formal generic Bayesian network model adaptable to any genetic disease with variable age at onset. Secondly, in Section 2.2, we provide in this context all the necessary details to carry belief propagation on this model and express the marginal posterior carrier distribution using Bayesian network’s potentials. Thirdly, in Section 2.3, we give closedform formulas for the posterior individual disease risk and introduce a simple numerical algorithm allowing taking into account the competing risk of death. Finally, in Section 3, all the methods are illustrated with the ClausEaston model for breast cancer using the disease model and the parameters of Claus et al. [9] and Easton et al. [10]. In particular, individual predictions derived by taking into account the competing risk of death or ignoring it are compared, which emphasizes the importance of properly taking into account competing risk of death in such models.
2. Materials and Methods
In this section, we first introduce our model (Section 2.1) as a Bayesian network. We next explain how to perform belief propagation in order to obtain posterior carrier distributions (Section 2.2). Finally, we provide all the details needed to derive disease risks predictions from these posterior distributions, including taking into account the competitive risk of death (Section 2.3).
2.1. The Bayesian Network
We consider a total of (related) individuals. With , we denote by the subset of the founders (i.e., individuals without ancestors in the pedigree) and we denote by the set of nonfounders (i.e., with ancestors in the pedigree). Let be the genotypic distribution (for the sake of simplicity, we consider here a simple biallelic gene but multiallelic genes can obviously be easily considered) of the whole family, where denotes the genotype of Individual . Let be the time vector representing the age at diagnosis of all individuals. The joint distribution of is given bywhich corresponds to the definition of a Bayesian network (BN). See Koller and Friedman [20] for more details. The genetic part of (1) only relies on the “classical” Mendelian assumption that the distribution of a nonfounder genotype only depends on the parental genotypes. The survival part makes the strong assumption that all are conditionally independent given . This assumption is clearly not true when considering any other familial effect on the disease (e.g., polygenic effect and environmental exposure) which is often taken into account using a familial random effect (often called frailty in the survival context). Such familial random effect is, for example, assumed to account for a polygenic effect in the BOADICEA model [11, 21]. Note that, for the sake of simplicity, the symbol “” corresponds throughout the whole paper either to a probability measure or to a density.
The extension of the present model to frailty models such as BOADICEA is clearly possible and, in many ways, quite straightforward. However, for the sake of simplicity, we focus here on a simpler model and will briefly discuss the extension in the conclusion section. However, even with the strong assumption that only depends on , since (the basically unobserved) has a strong correlation structure within the pedigree, so does .
We can see in Figure 1 an example of a moderate size (hypothetical) family with a severe history of breast and ovarian cancer. This family has a total of individuals with and . There is no inbreeding (mating between individuals with a common ancestor) in this family but a mating loop (two families joined more than once by mating) due to the two brothers of the first nuclear family having children with two sisters of the second nuclear family. Such looped pedigree can be tricky to represent and this explains why Individual 7 appears twice (with an identity link) in Figure 1.
One should note that loops in pedigree are not the same as cycles in the Bayesian networks framework in the sense that the underlying conditional dependence structure of the model remains a proper directed acyclic graph even in the presence of pedigree with loops.
Genetic Part. For the genetic part, we assume that founders’ genotypes are distributed according to the HardyWeinberg distribution with disease allele frequency . It means that for any founder we have , , and . This assumption is extremely frequent in family genetics and usually reasonable since it corresponds to the stationary distribution we observe in a population under mild assumptions. However, one should note that other distributions can easily be considered if necessary (e.g., genotype forbidden because it is lethal). For the nonfounder we simply assume a Mendelian transmission of the alleles, but unbalanced transmission patterns can also be considered.
The genetic part of the model can also be easily extended to account for various constraints. For example, the presence of monozygous twins, say individuals and , only requires one to add an identity variable between the two genotypes: such as . Genetic tests (including error or not) can also be incorporated as additional variables such as corresponding to the test specificity and sensibility. Finally, assuming lethal genotypes (e.g., genotype ) is done straightforwardly by setting to the probability of carrying such genotype. This is equivalent to working conditionally on which obviously alter all genotype distributions, including HardyWeinberg for founders.
Survival Part. We place ourselves in the classical survival framework, denoting by the (time dependent) hazard function and by the survival function defined as , where is the cumulative hazard.
We assume an autosomal dominant model where noncarriers have a disease incidence and carriers have a disease incidence . This simple assumption results in the following expression of the survival part of the model:As explained above, the symbol “” corresponds to a (conditional) probability measure for the event and to a density for the punctual event .
For example, in the context of the THA, noncarriers cannot be affected () and only carriers have an agedependent incidence. In the context of breast cancer, might be the incidence for nonBRCA carriers and the incidence for BRCA carriers (BRCA1 or BRCA2).
Of course, the simple model suggested in (2) can easily be extended to account for other genetic models (e.g., recessive, additive, gonosomal (i.e., nonautosomal), and with parentoforigin effect) as well as for any known covariates (e.g., BMI, smoking, and other diseases) using a classical proportional hazard model.
Hazard rates and are typically described by the literature as piecewise constant hazards (PCHs), but our model allows for any parametric or nonparametric shape as long as hazard rates are provided (e.g., hazard rates of Weibull distributions and Gaussian survival).
2.2. Carrier Risk
For all Individuals , let us denote by their personal history of the disease. In the case where Individual was diagnosed with the disease at age we have . If Individual was unaffected at age (age at the last followup), the variable is rightcensored and we have . From now on, we denote by the family history of the disease. This includes the personal history of all individuals and all possible additional constraints or information (e.g., monozygous twins, genetic tests, and lethal alleles). Formally, we can define , where is the subset of allowed values for (e.g., if we know that genotype is lethal and if we know that a particular individual is a noncarrier). Even with genetic testing, it is essential to understand that is, at best, partially observed. Indeed, even with a (hypothetical and unrealistic) 100% specificity/sensitivity test, a positive heterozygous carrier status cannot distinguish between genotypes and . Moreover, genetic tests are in general only available for a few individuals in the whole pedigree. Accounting for the unobserved genotypes is therefore of utmost importance.
Following the classical BN notations, we write the socalled evidence as the simple following sumproduct of potentials:where the potentials are defined bywhere is either or and can be obtained through (2). Note that denote the parental set of Individual (empty for founders) and that for any . As explained above, any additional information or constraint might and should be added directly into the potentials.
Since has possible configurations in the worst case, it is clearly impossible to simply enumerate these configurations even for moderate size pedigrees (e.g., for or ). We therefore need a more efficient algorithm to compute (3). An efficient solution is provided by the ElstonStewart algorithm [22] in the particular (and frequent) case where the pedigree has no loop. The basic idea is to eliminate variables from the sumproduct (peeling in the ElstonStewart literature) from the last generations up to the oldest common ancestor. The resulting complexity clearly allows one to deal with arbitrary pedigree size as long as there is no loop.
Unfortunately, loops (inbreeding or mating) are not totally uncommon in pedigrees and therefore have to be accounted for. A simple extension of the ElstonStewart algorithm consists in using loop breakers: working conditionally to a few number of key genotypes that can be considered as duplicated individuals with known genotypes in a pedigree with no loop. For example, in Figure 1, Individual is a possible loop breaker. By performing a classical ElstonStewart algorithm for each genotypic configuration of the loop breakers, can be computed with complexity , where is the number of loop breakers.
In the context of Bayesian networks, computing (and, in fact, the whole distribution) is typically done through belief propagation (BP) (also called sumproduct algorithm) with a complexity, where is the treewidth of the graphical model (see [20], for more details). For a pedigree with no loop, and the BP complexity is strictly the same as ElstonStewart, but for more complex pedigrees, usually increases much slower than and, as a result, BP is often dramatically faster than ElstonStewart with loop breakers.
In order to achieve this, BP basically eliminates variables from the sumproduct of (3) in a suitable order. In that sense, it is very similar to the notion of cutset long used to compute likelihoods in complex pedigrees (see [23], for a recent reference on the MENDEL package). But BP has the noticeable advantage to allow obtaining the full posterior distribution for the same algorithmic complexity while likelihoodbased approaches need to repeat many cutset eliminations to achieve the same results. As a consequence, it should not be surprising to see that, in parallel with the classical genetic literature [22–24], many authors have been using BP and BN to deal with genetic models [14–18].
Let us finally point out that the genetics community has put considerable efforts in developing ElstonStewart algorithms for any Bayesian network counterpart, claiming that peelingbased algorithms are more natural for geneticists than junctiontree based ones. Note however that the most general version of these peeling algorithms [19] is in fact exactly equivalent to the classical junctiontree based forward/backward algorithm presented below.
For completeness, we will now briefly recall all the minimal necessary results to implement BP in the context of our model. We nevertheless encourage the interested reader to refer to more classical references like Lauritzen and Sheehan [17] or Koller and Friedman [20] for more details.
Variable Elimination and Junction Tree. As an example, we consider the pedigree of Figure 1 and want to compute by successive variable elimination. We use the following elimination order: , , , , , , , and . Here follow the quantities obtained in the process:We therefore can obtain by considering only configurations over the total number of configurations. Note that a memory bounded version of the variable elimination exists; see Darwiche [25] for more details.
Figure 2 is a graphical representation of this particular sequence of elimination and is also a junction tree defined as a set of cliques with with the following properties:(i)Tree: each clique is connected to a subsequent clique ( by convention). We also define () and (with the convention that ).(ii)Covering: for all there exists such as . We then define and .(iii)Running intersection: for all the subgraph formed by (and the from/to relationships) is a tree.
In the graph theory, junction trees are used as an auxiliary structure for many applications (e.g., graph coloring). The proof that any elimination sequence gives a junction tree can be found in Koller and Friedman [20]. The treewidth of an elimination sequence/junction tree is defined as the size of its largest clique. Finding the elimination sequence with the smallest treewidth is NPhard in general, but many heuristics are available [20]. The elimination order of Figure 2 has been obtained using the wellknown minimum fillin heuristic.
Belief Propagation. We assume that a suitable elimination order/junction tree has been obtained. For all we hence define the potential of clique as and we have the following result.
Theorem 1 (posterior distribution). For all , let and we have where the forward quantities are defined for by and the backward quantities are defined by (convention) and for , for all :
Proof. See Appendix A.
Using Theorem 1, it is therefore possible to obtain and all by just recursively computing once all forward and backward quantities.
2.3. Disease Risk
While the previous section covered the computation of the posterior probability for all individuals in the pedigree, we now focus in this section on computing individual posterior disease risks, with or without the competing risk of death.
Risk without Competing Events. We consider an Individual with a posterior carrier probability at age ; that is, . Conditionally to the family history, we denote the survival and hazard functions, respectively, by and such that, for , and . We have the following result.
Theorem 2. For any , we have
Proof. Appendix B.
Risk with Death as a Competing Event. As explained in the introduction, death precludes the occurrence of the disease. This needs to be taken into account by defining the hazard rate of the disease conditionally to the fact that both disease and death have not occurred yet. From a statistical point of view, such a situation can be seen as a competing risk situation or as an illnessdeath model; see Andersen et al. [26] or Andersen and Keiding [27] for a presentation of such models. We define as the minimum between age at disease onset and age at death and we keep the notation to denote the age at disease onset. Given an individual with a family history , its hazard rate for the disease is defined asWe denote by and the hazard and survival functions of (conditionally to the family history) and we assume that and are piecewise constants with common cuts (i.e., and for ).
Lemma 3. For , , we have
Proof. See Appendix B.
Practical Computations. We assume that one individual has a carrier probability at age (his age without the disease in the FH). We denote by his/her hazard of death. Then the posterior disease risk with the competing risk of death can be computed through the following steps:(1)Choose a fine enough discretization (e.g., all year).(2)Compute using (10).(3)Compute .(4)Then the marginal posterior probability of being diagnosed with the disease before age , in the presence of death as a competing risk, is given for by
3. Results and Discussion
3.1. The ClausEaston Model
In order to illustrate our method, we will use the model of illness and the parameters of the ClausEaston model developed from the Cancer and Steroid Hormone Study in the 90s [9, 10].
The ClausEaston model is a classical genetic model composed of a genotypic part and a phenotypic part with only the family history (FH) as covariate. It assumes an autosomal dominant mode of inheritance and a piecewise constant hazard rate by steps of 10 years. The penetrance () and the density () are given in Table 2 from Easton et al. [10] for both carriers and noncarriers at ages . The hazard rates can therefore be derived from these data using the formula . The results of these computations are given in Table 1. The frequency of the mutated allele has been estimated at [9]. The death incidences needed in the competing risk section are given in Table 2.


Figure 3 presents the incidence and survival for BC (carriers and noncarriers) as well as death. We can notice that the breast cancer incidences in carriers are always much higher than in noncarriers at any age and the relative risk between carriers and noncarriers is especially large () before age 40 (see Table 1) but then decreases with aging. We notice that the death incidence stays above the BC incidence for noncarriers at all ages and exceeds even the BC incidence for carriers from age 80. This shows the importance of taking it into consideration especially over a certain age.
(a)
(b)
3.2. Carrier Risk
In this section we will use the belief propagation in Bayesian networks to obtain the posterior distribution of individual genotypes given the FH. We get the posterior probabilities of each genotype (noncarrier, heterozygous carrier with a paternal mutated allele, heterozygous carrier with a maternal mutated allele, and homozygous carrier).
Figure 4 represents the marginal posterior probability for all Individuals and for (paternal carrier) and (maternal carrier). Note that the posterior probability of the monozygous carrier genotype () being almost zero for each individual is not shown here. The posterior probability of the noncarrier genotype can be easily deduced.
We can notice that the probabilities of being a noncarrier for 1, 3, 4, 7, 8, and 9 are all by far the highest despite the severe phenotype of relatives (granddaughter, niece, or daughter). This result is consistent with the personal history of Individual 2 (ovarian cancer at age 51) which points her out as the most likely origin of the mutation in the family. Let us note that since we have no additional information on the ancestors of Individual 2, it is impossible to determine whether her mutation was transmitted by her father or her mother. As a consequence, the posterior carrier probability is equally shared between the paternal and maternal carrier genotypes.
Considering the severe personal history of cancer of Individuals 10 and 11, the most likely situation would be that they both received the mutation of their grandmother through their respective fathers (Individuals 6 and 5, resp.). The posterior probabilities are clearly consistent with this scenario: Individuals 5 and 6 have a probability of ≃90% to be maternal carriers, and Individuals 10 and 11 have similar probabilities to be paternal carriers. Note that Individual 12, being unaffected at age 37 (which is not very informative), basically has chance to have received the mutation from her father.
Figure 5 shows some examples of the variations of the posterior marginal distribution of the genotypes in the same family structure according to different FH. We first notice that with no information (FH1) the posterior probabilities are exactly those of the general population: .
Note that Individual 2 has a severe personal history of cancer (ovarian cancer at age 51) in all other examples. As a consequence, Individual 1, as a male with no personal history of cancer, is mostly totally uninformative and therefore not included in the forthcoming analyses.
Individual 4 having no children is independent from the rest of the family conditionally to her phenotype and her parent’s genotype. With no information about her phenotype in any FH, her probability of being a carrier is therefore almost half her mother’s in each FH (because her father is almost uninformative). If we compare the posterior distributions of the genotype of Individual 3 in FH2, FH3, and FH4, we can notice that the ovarian cancer of her mother which increased her mother’s probability of being a carrier raises her probability of being a carrier (FH2). A piece of protective information about her phenotype such as no cancer until age 61 lowers her posterior probability of being a carrier (FH3). On the contrary, the cancer at young age of her daughter which increases her daughter’s probability of being a carrier raises her own probability of being a carrier (FH4–6).
We also notice the causal relationships in a whole branch of the family with the transmission between Individuals 2, 3, and 6 of the deleterious allele being highly probable which raises the probability of being a carrier for Individual 3 even in the presence of a protective phenotype (unaffected at age 61) in FH4.
We finally observe the influence of the spouse’s genotype when having children (FH5). The higher risk of being a carrier for Individual 5 (because of his cancer at age 72) strongly decreases the carrier probability of his spouse (in comparison with FH4) since the paternal origin of the disease mutation naturally becomes the most likely event. On the other side, the increase of risk for Individual 3 when suppressing her protective phenotype (FH6) also has a consequence on the marginal posterior distribution of her spouse in lowering his probability of being a carrier as his participation in the risk for their daughter is lowered.
To summarize, one’s probability of being a carrier mainly depends on one’s probability of having at least one carrier parent, which is correlated to the history of cancer of one’s ancestors, and one’s probability of having transmitted the mutation to one’s offspring which is correlated to the history of cancer of one’s descendant relatives and one’s spouse probability of being a carrier.
Remark 4. As introduced in the Disease Risk, we know that posterior carrier probabilities should decrease with time for unaffected individuals. For example, if we assume that Individual 4 is unaffected at age 40 in FH6, her probability of being a carrier is 24%. If she stays unaffected up to age 60 and age 80, her probability of being a carrier decreases to 15% and 8.5%, respectively.
Table 3 gives a practical illustration of the dependence and conditional independence in a trio (grandparentparentchild). We compare the posterior joint distribution and the product of the posterior marginal distributions of genotypes and in FH4 with various information on . We can see that these two quantities are not equal when is not observed while they are exactly the same when is fixed. This example demonstrates how and are not conditionally independent given FH but they are, conditionally to FH and . Note that when , the mutation is necessarily found in both parents (Individuals 1 and 2) as well as in her daughter (Individual 6).

3.3. Cancer Risk
As in Section 2.3 we now consider a female individual who is unaffected at age (i.e., ) and denote by its posterior carrier probability. The purpose of this section is to compute the posterior risk of cancer for this individual (with or without the competing risk of death). As previously explained, these risks only depend on and .
Figure 6 represents the individual risk of breast cancer up to age 100 (note that we obtain qualitatively similar results with a lower age limit (e.g., age 80), but quantitative results are more illustrative with age 100) without the competing risk of death and variant and . We can see that the individual risk of BC rises as increases and decreases. This result is quite intuitive as the younger a patient is, the longer she will be at risk until age 100; the greater her probability of carrying a deleterious allele, the greater her risk to develop a cancer.
As introduced in the previous section the probability of being a carrier for an unaffected individual decreases with time if she stays unaffected. Assuming Individual 4 was 52 in FH4, Figure 7 shows the evolution of the probability of being a carrier for Individual 3 and Individual 4 in FH4. As they stay unaffected we can clearly see the decrease of this probability which has to be taken into account in the computation of the individual risk of breast cancer over time (see Section 2.3).
As explained in Section 2.3, computing risk with the competing risk of death requires a numerical discretization of age by a fixed step . In order to calibrate we used as a reference and observed that is a reasonable balance between accuracy and computational efficiency (data not shown).
Figure 8 represents the individual risk of breast cancer for Individual 7 ( and years) and Individual 12 ( and years) in our hypothetical family from to 100 years with and without taking into account the competing risk of death. We can see that the difference between the two curves for each individual is increasing with the age. The age from which the difference becomes significant varies with the couple (). We also observe that the individual risk of breast cancer eventually reaches a plateau which corresponds to the point where the incidence of breast cancer becomes negligible compared to the incidence of death in the elderly.
Quantitatively, the importance of taking into account the competing risk of death is pointed out in Figure 9 which represents the difference between the individual risks of breast cancer up to the age of 100 years for variant couples (, ). For example, for Individual 3 in FH4 ( and , see Figure 5), the error while calculating her individual risk of breast cancer up to the age of 100 years reaches almost 14%. If it is clear that the competing risk of death can have a limited effect on the global risk of cancer for certain couples , its effect is never totally negligible, and since we provide a rigorous way to take it into account, we strongly advocate its use in all circumstances.
4. Conclusions
We presented here a general model for genetic disease with variable age at onset. This model, a Bayesian network, combines classical genetic modeling with survival analysis. In order to deal with the (mostly) unobserved genotypes, we first explained in detail how belief propagation can be used to perform likelihood and posterior probability computations. Secondly, we focused on the challenging problem of computing posterior individual disease risks, with or without taking into account the competing risk of death. Finally, we illustrated these results with the ClausEaston model for breast and ovarian cancer. The R source codes are available upon request for the interested readers.
For the sake of simplicity, we only considered a biallelic locus with standard distribution (autosomal, HardyWeinberg, and Mendelian allele transmission) but extensions (e.g., multiloci, unbalanced allele transmission, and lethal genotypes) are straightforward. For the survival model, we presented a simple dominant effect without covariates, but again extensions to any proportional hazard model (e.g., recessive, additive, and with covariates) are easy to implement. Incorporating random effects (at the individual and/or familial level) in the model (like in the BOADICEA model, see [11, 21]) is clearly also possible but slightly more challenging.
Computation of posterior carrier distributions remains almost unchanged except for the random effect support which must be discretized (five values are claimed to be sufficient in the BOADICEA literature) and for the belief propagation which must be performed once for each of the possible values of the random effect. For posterior risks, calculations get slightly more complex since the posterior individual hazard must now be integrated over the (changing over time) posterior joint distribution of the individual genotype and of the random effect. Basically, all computations are slightly more intensive with random effects, but most results of Section 2.3 remain very similar.
One of the important limitations of the present work is the fact that we assume that all model parameters are known. However, it should be noted that likelihood and conditional likelihood might be easy to compute through the belief propagation which means that we basically provide all the necessary means to estimate the model parameters from actual data. In that context, it is nevertheless critical to deal efficiently with ascertainment issues: the fact that the family ending up in the database are usually precisely the one with the most severe disease family history. But standard methods like the PEL [5], which basically are conditional likelihood computations, are known to deal relatively well with the problem.
In order to take into account the competing risk of death, we used death from all causes, which was obtained from registry data [28]. However, only death without cancer precludes the onset of cancer and we are not interested in death from all causes. Since registry data usually do not report the causes of death it is a difficult task to estimate the risk of death without cancer. This has been studied, for instance, in Wanneveich et al. [7] through an illnessdeath model, using registry data and differential equations to model the specific causes of death. Nevertheless, it is very likely that the gain in terms of predictions would be minor as mortality from all causes is likely to be close to mortality without cancer.
Further work includes all the extensions described above (e.g., more complex genetic model, genetic tests, and familial random effects) as well as the development of a clinical web application for the ClausEaston model in close collaboration with the cancer genetics department of the Institut Curie. From the methodological point of view, we plan to focus on the computation of more complex posterior distribution like the number of carriers in any subgroup of individuals and/or the familial posterior risk (time before any family member at risk is diagnosed).
Appendix
A. Proofs for the Carrier Risk Section
For all we recursively define , , and . Then we can compute the socalled forward and backward quantities over any separator : where and .
The key is then to prove that, for all , we have
For proving (A.2), we start by noticing that the JT (junction tree) properties [20] give and (both being disjoint unions). We therefore have the factorization between the first and second equation being possible thanks to the fact that (JT properties again).
The proof is basically the same for (A.3) using ; we get
The factorization is possible as (running intersection) and , , , and .
Finally, the recursive expression of the forward and backward quantities can be easily derived from (A.2) and (A.3):
which gives the forward recursion by simplifying the term.
B. Proofs for the Disease Risk Section
Proof of Theorem 2. For clarity, we recall that , , , and , for , and that . Since are independent conditionally to , the distribution of conditionally on obviously does not depend on (for values of which are not forbidden by ). This is why can be omitted almost everywhere in the following proof as soon as has been computed.
We have , where the notation represents the summation over the different possible values of ; that is, or . Using Bayes’ rule, where we used the fact that . We similarly prove that
The next result is proved using Bayes’ rule: where we also used the fact that .
We then directly have from Bayes’ rule, , and which concludes the proof.
Finally, in order to prove (10), we recall that Then, using Bayes’ rule and the fact that and . Dividing by and taking the limit as tends to giveWe showed previously that and which concludes the proof.
Proof of Lemma 3. The first part of the equality is a standard result in the competing risk setting: we have, from Bayes’ rule, and consequently is equal to the density of conditionally to . Then, since for , we have Now, for , , The integral on the right side of the equation is straightforward to compute. This gives