Statistical methods and the subjective basis of scientific knowledge

Preamble When the Editor of Genetics, Selection, Evolution asked me to translate this paper by the late Professor Gustave MALECOT into French, I felt flattered and intimidated at the same time. The paper was extensive and highly technical, and written in an unusual manner for today’s standards, as the phrases are long, windy and, sometimes, seemingly never ending. However, this was an assignment that I could not refuse, for reasons that should become clear subsequently. I have attempted to preserve MALtCOT’s style as much as possible. Hence, I maintained his original punctuation, except for a few instances in which I was forced to introduce a comma here and there, so that the reader could catch some breath! In those instances in which I was unsure of the exact meaning of the phrase, or when I felt that some clarification was needed, I inserted footnotes. The original paper also contains footnotes by MALTCOT; mine are indicated as "Translator’s Note", following the usual practice; hence, there should be little room for confusion. There are a few typographical errors and inconsistencies in the original text, but given the length of the manuscript and that it was written many years before word processors had appeared, the paper is remarkably free of errors. This is undoubtedly one of the most brilliant and clear statements in favor of the Bayesian position that I have encountered, specially considering that it

probabilities a priori times their likelihoods as a function of E (all this holding in the interior of system K). The proportionality constant can be arrived at immediately by writing that the sum of posterior probabilities is equal to 1. The preceding rule still holds in the case where one cannot specify all possible hypotheses B i or all the probabilities P (E[0 1 K) of their influence on E, but then the sum of posterior probabilities P (0 1 [EK) of all the hypotheses that one has been able to formulate their consequences would be lesser and not equal to 1.
We will show how BAYES formula provides logical rules for choosing one B i over all possible B i , or among those whose consequences can be formulated; further, it will be shown how the rules adopted in practice cannot have a logical justification outside of the light of this formula.

THE RULE OF THE MOST PROBABLE HYPOTHESIS
We shall begin a critical discussion of the methods proposed by FISHER's school by posing the rule of the most probable value: choose the hypothesis B i having the largest posterior probability, with the risk of error given by the sum of the probabilities of the hypotheses discarded (when one can formulate all such hypotheses)(the risk will be small only if this sum is small; it may be reasonable to group together several hypotheses having a total probability close to 1, without making a distinction between them; this we shall do in Section VII) In order to apply this rule, it is necessary to determine the B i giving the maximum of P (E[9 1 K) P (9 1[ K). It follows that the choice of Bi depends not only on the likelihoods of the B i but also on their prior probabilities, often subjective and variable between individuals, even within individuals depending on the state of their knowledge or of their memory. However, it must be noted that the presence of the prior probability in the formula is in perfect agreement with the rule, admitted by most experimenters, of combining (weighted naturally) all observations that provide information about a certain hypothesis. Suppose that after the experiments E, another set of experiments E' is carried out: collecting all such experiments one has: and the rule leads to choosing the 9 1 that maximizes the numerator; however, the first term represents the likelihood of O i as a function of Ef within the system EK, and the product of the last two is proportional to the probability of O i within the system EK, that is: which is the probability a priori of O i before realization of E'; it follows then that one would obtain the same result maximizing P (E' [ 9 jEK) x P (9 j [EK) , that is, the product of the likelihood times the new prior probability. The rule of the most likely value, as stated, takes into account all our knowledge, at each instant, about all hypotheses examined, and every new observation is used to update their probabilities by replacing the probabilities evaluated before such observation by posterior probabilities. The delicate point is what values should be assigned to the probabilities a priori before any experimentation providing information about the hypotheses takes place. LAPLACE and BAYES proposed to take the prior probabilities of all hypotheses as equal, which makes the posterior probabilities proportional to the likelihood, leading in this case to the rule of maximum likelihood proposed by Mr. Fisher l , a rule that, unlike him, does not seem possible to me to adopt as a first principle, because of the risk of applying it to a given group of observations without considering the set of other observations providing information about the hypotheses considered. A striking example of this pitfall is the contradiction, noted by Mr. Jeffreys 2 , between the principle of maximum likelihood and the underlying principle of &dquo;significance criteria&dquo;. In this context, the objective is to determine if the observed results are in agreement with a hypothesis or with a simple law (the &dquo;null hypothesis&dquo; of Mr. Fisher), or if the hypothesis must be replaced by a more complicated one with the the alternative law being more global, including the old and the new parameters. To be precise, if the old law depends on parameters &OElig;l,..., &OElig;p, the new one will depend in addition on &OElig;p+l,&dquo;', aP+q and will reduce to the old one at given values of a P+1 , ... , aP+9 which can always be supposed to be equal to 0 (that is why the name &dquo;null hypothesis&dquo; is given to the assumption that the old law is valid). The maximum of P (EI&OElig; l &dquo;'&dquo; &OElig;p+q, K) when all the a i vary will be larger in general than its maximum when a P+1 = ... = ap + q = 0, hence, the rule of maximum likelihood will lead, almost always, to adopting the most complicated law. On the other hand, the usual criterion in this case is to investigate if there is not a great risk of error made by adopting the simplest law: to do this one can define a &dquo;deviation&dquo; between the observed results and those that would be expected, on average, from the simplest law, and then find the prior probability from such law of obtaining a deviation that is at least as large as the observed distance. It is convenient not to reject the simplest law unless this probability is very small. This is the principle of criteria based on &dquo;significant deviations&dquo;.
Hence, the simplest law benefits from a favorable prejudice, that is, of having a prior probability that is larger than that assigned to more complex laws. Why is it prejudged more favorably? Sometimes this is the result of our belief on the simplicity of the laws of nature, a belief that may stem from convenience (examples: the COPERNICUS system is more convenient than that of PTOLEMY to understand the observations and to make predictions; fitting of an ellipse to the trajectory of Mars by KEPLER without consideration of the law of gravitation), or from previous experience.
Consider the example of a fundamental type of experiment in agricultural biology: comparing the yields of two varieties of some crop, by planting varieties V and V' adjacent to each other at a number of points A l , ... , A N of an experimental field, so as to take into account variability in light and soil conditions. If x l , ... , x N and x ...... z% are the yields of V and V' measured at the N points, two main attitudes are possible when facing the data: those inclined to believe that the difference between V and V' cannot affect yield will ask themselves if all x i and x' can be reasonably viewed as observed values of two random variables X and X' following the same law; for this, they will adopt a significance test based on the difference between the means, and they will maintain their hypothesis if this difference is not too large. On the other hand, those whose experience leads them to believe that the difference in varieties should translate into a difference in yield will admit a priori that the random variables X and X' are different, introducing right away a larger number of parameters (for example, X, a, X!, , 0 &dquo; if it is accepted that X and X' are Laplacian) and they will be concerned immediately with the estimation of these parameters, in particular X -X', by the method of maximum likelihood for example (which in the case of laws of LAPLACE with the same standard deviation, gives as estimator of X -X! the difference between arithmetic means of the x i and x'); this method assumes implicitly that the prior probabilities of the values of X -X! are all equal and infinitesimally small, which is quite different from the first hypothesis where a priori we view the value X -X! = 0 (corresponding to identity of the laws) as having a finite probability. These two different attitudes correspond to different states of information a priori, of prior probabilities; the statistical criteria are, thus, not objective, because there could not be a contradiction between the two: it is not possible that one leads to the conclusion that X &mdash; X' = 0 and the other to conclude that X -X' # 0. This discrepancies result from the fact that the criteria are subjective and correspond to different states of information or experience.
We shall now take an example from genetics. A problem of current interest is that of linkage between Mendelian factors. When crossing a heterozygote AaBb with a double homozygote recessive, we observe in the children, if these are numerous, the genotypes ABab, abab, Abab, aBab in numbers a, ( 3, 7 , 8 (&OElig; + (3 +, + 8 = N), leading to admit that, independently, each child can 1 &mdash; ?* 1 &mdash; y r r possess one of the 4  is, take 2 4 dr as the probability that r 7! 2 lies between r and r + dr, then it is easy to form the posterior probabilities of r = 2 and r 2 ! the likelihood of r (the probability that a given value r produces numbers a, / 3, q, 6 in the four categories will be: Of these two, we will retain the hypothesis having the largest posterior probability; if this is hypothesis r 7!1, we would take as estimate of r, within 2 all values r -I-!, 2 the one maximizing the posterior probability, that is, the _ ! 7 +a maximizer of the likelihood 2-!' (1 -r) a +a r l+8 , which has as value r = N . * N I have deliberately presented the problem in a somewhat shocking manner, emphasizing that the prior probabilities are known. Nevertheless, it cannot be argued that the rule at which we arrive is not that in current use, or that at least it is in close numerical proximity 3 : reject the &dquo;null hypothesis&dquo; if this gives a large discrepancy with the observations; subsequently, estimate the parameters by maximum likelihood. My objective has been to show on what type of assumptions one operates, willingly or unwillingly, when these rules are applied. Using prior probabilities, it is possible to see the logical meaning of the rules more clearly, and a possibly precarious state of the assumptions made a priori can be thought of as a warning against the tendency of attributing an absolute value to the conclusions (as done by Mr. MATHER who gives a certain number of rules as being objectively best, even if these are contradictory): we take note of the arbitrariness in the choice of the prior probabilities and in the 1 1 manner of contrasting the hypotheses r =and r -; 2 and we also see how the conclusion about the value of r is subjective.

OPTIMUM ESTIMATION
We shall now examine another aspect of the question of the rule of maximum likelihood, which Mr. FISHER (7) thought could be justified independently of prior probabilities, with his rule of optimum estimation. Suppose the competing hypotheses are the values of a parameter 0, with each value giving to the observed results E a probability 7 r (E [ 9) before observation, which is a function of 0, its likelihood function; we will call an estimator of 0, extracted from observations E, any function H of the observations only giving information about the value of 0; same as with the observations, this estimator is a random variable before the data are observed, its probability law depending on 0. (In the special case where, once the value H is given, the conditional probability law of E no longer depends on 0, it is unnecessary to give a complete description of E once H is known, because this would not give any supplementary information about 0, and we then say that H is an exhaustive 4 estimator of 9.) It is said that H is a fair estimator 5 of if its mean value M(H) 6 is always equal to the true value irrespective of what this is. It is said that H is asymptotically fair 7 if M(H) -9 is infinitesimally small with N, N being the number of observations constituting E.
It is said that H is correct8 if it always converges in probability towards 0 when N tends towards infinity. (For this, it suffices that H be asymptotically fair and that it has a fluctuation9 tending towards 0. Conversely, every fair estimator admitting a mean is asymptotically fair). regle a laquelle nous arrivons ne soit, aux valeurs num6 T iques des probabilites pres, celle qui est d'un usage courant:... &dquo;. 4 Translator's Note: The English term is sufficient. Mal6cot's terminology is kept whenever it is felt that it has anecdotal value, or to reflect his style. 5 Translator's Note: Unbiased estimator. 6 Translator's Note: It is useful to remember hereinafter that M (expression) denotes the expected value of the expression. The M comes from &dquo;moyenne&dquo; = mean value. 7 Translator's Note: Asymptotically unbiased. 8 Translator's Note: Consistent. 9 Translator's Note: Fluctuation = Variance. It is said that H is asymptotically Gaussian if the law of H tends towards one of the type LAPLACE-GAUSS when N increases indefinitely. In statistics, it is frequent to encounter estimators that are both correct and asymptotically Gaussian; we shall denote such estimators as C.A.G (see, DUGUE, 5). The precision of such an estimator is measured perfectly by M [(H -8) 2 ] = ( 2 , 1 this becoming infinitesimally small with N; the precision will increase as !2 N decreases, hence I = & m d a s h ; , which will be termed the quantity of information extracted by the estimator, will be larger.
In what follows, we will restrict attention to the case where E consists of N independent observations xl, ... , !n, with their distribution functions being a priori: The probability of a set E of observations is: (Stieltjes multiple differential) with with the integration covering the entire space !J2N described by the Xi ,... X N.
It is then easy to show, with Mr. FRECHET (8), that the fluctuation !2 of any fair estimator has a fixed lower bound. Let [dk] and, therefore, also k is independent of 0 nearly everywhere, that is, if H is an exhaustive estimator; the general form of laws admitting an exhaustive estimator has been given by Mr. DARMOIS (3) and Mr. FRECHET has verified (8) that the exhaustive estimator meets the condition ( The condition !2 T2 14 cannot be met for finite N unless an exhaustive Q estimator exists. However, Mr. FISHER had shown earlier (7) that it would always exist, or at least that the condition would be met asymptotically when N ---> oo, when an estimator is obtained by producing as a function of E a value of which maximizes the likelihood function 7r(E'!), that is, by applying the rule of maximum likelihood; this estimator H o , being C.A.G. under fairly wide conditions, and its fluctuation (,2 oc T2 1 being asymptotically smaller or equal than that of any other such estimators, would be in the limit one of the most precise C.A.G. estimators and would merit the name of optimum estimator. Its amount of information will be 14 T ra nslator's Note: This is a typographical error since the ç's were defined as random variables. The correct expression is ( 2 = 1 or2' For any other C.A.G. estimator obtained from the same observations E and --1 7 (!2 1 hich is with amount of information 1 ( 2 the ratio £ C -2 2 = ! !z , which is ! ! (! !!! smaller or equal to 1, will be called ef f iciency&dquo; of the estimator; it gives the loss of precision accruing from using an estimator other than the optimum. We shall now give a rigorous and general presentation of Mr. FISHER's theory, extending results of Mr. DOOB and of Mr. DUGUE (5). Let g (x i , B) be a function of random variable x i and of the unknown parameter B, and suppose that the N random variables g (x i , B) have true means for each value of 0 that are &dquo;equally convergent&dquo;, that is, that the N probabilities . have an upper bound given by a function p (t) independent of i which generates /.+00 a finite i n t e g r a l } r + o o 0 t dp (t). If we suppose that tends towards a limit cp (B) as N ---7 oo, for every value of 0 in an interval A...B, the extension of a result of Mr. KOLMOGOROFF (9) 15 shows that the quantity deduced from N observations a;i,... xN, tends almost surely, when Noo, towards cp (B). If one supposes that the g (x i , 0) are almost surely functions of B with variation bounded by the same fixed number K ( & d q u o ; equally bounded vari-ation&dquo;, the same holding for ! (B, N)), an extension of POLYA-CANTELLI's theorem shows that when N ! 00 , W (0, N) converges almost surely towards cp (0) in the interval A ... B 16 , which means that the probability that tends towards 1 as No -> oo, whatever the value of B is and for N > No ( q being an arbitrary, fixed, number. 15 Translator's Note: The English spelling is KOLMOGOROV. 16 This holds even if there are discontinuities (of the first kind) by considering, instead of the value of 0, the limiting values at right and left (supposed to satisfy the same conditions): In what follows, it will be convenient to represent by p (B) the set of values comprised between cp (B -o) and cp (0 + o), and by 0 (0, N) the set of values comprised between cp (Bo, N) and cp (9 + o, N).
Consider now a root 9 0 of p (6), suppose that it can be found and that it corresponds to a change of sign of cp (B): more precisely, suppose that in every interval 0 1 ... B 2 surrounding 9 0 there is at least one value between 9 1 and 90 for which cp (0) is negative, and that there is at least one value between 0 2 and 90 for which it is positive. If we let be the smallest of the two corresponding [p (0) it follows from the preceding that, for N > No, the probability that all the W (B, N) change from positive to negative inside the interval 0 1 ... B 2 and, therefore, the values cancel each other (in view of the statement in the preceding footnote, for the points in which there is discontinuity), tends towards 1 when Noo. Because the interval 0 1 ... B 2 in the neighborhood of 9 0 can be taken to be arbitrarily small, this means that the equation B ]! (0, N) = 0 admits at least a root converging almost surely to 9 o when N -! oo.
It is possible to go further if one supposes that the quantities &mdash; . ' 0 ) <7P and, hence, !! are almost surely uniformly continuous with respect to B, with cw equally bounded variation in A ... B, and that these have &dquo;equally convergent a qj true means&dquo;. It follows easily that !: (B, N) converges almost surely and 00 uniformly towards a continuous function which is surely the derivative of cp (0), that is, cp' (0) and then that one can associate to every e an interval 9 0a and 9 0 + a such that the probability that for all N > No and for all between 0 0a and 0 0 + a tends towards 1 when N, --4 00. Now, from the formula of finite increments, these inequalities imply, for N > No and for all 0 between 9 0a and 90 + a: (where D is the fixed number cp' (0 o )); this shows that the equations XP (e, N) = 0 will have, for N > No and within the interval 0 0a and 0 0 + a, a single root, and that this root will be each time between provided that these quantities take values between 90 -a and 8 0 + 0 :: this will be attainable with probability tending to 1 when No --! oc because qf (0 0 , N) tends almost surely to 'P (0!) = 0. Hence, it is seen that the equation IF (0o, N) = 0 admits only one root 8 N tending almost surely to 0 0 ; the probability that (for each value of N > No) this root is equal to with Ei < e, tends towards 1 when No -+ oo irrespective of the value of 6 -9 N is then a correct estimator of B o l7 .
Let us make now the following additional assumptions: the N random variables g (x i , 9 0 ) constitute a normal family in the sense of Mr. P. LEVY (for this, it suffices to suppose, using the notation of Mr. P. LEVY, that 1 0 00 t 2 dp (t) is finite, which implies that the fluctuations a 2 of the random N variables g (x i , 9 0 ) are a bounded set and that the fluctuation 0' 2 2 U2 of i N their sum£ g (x i , 9 0 ) = N I P (90 , N) increases indefinitely with N. It is known i (P. LEVY, 11) that then the type of law of this sum tends to a Gaussian one, and one can deduce easily (DUGUE, 5) that this law is the same as that of 9N is, thus, not only a correct estimator of 0 0 but C.A.G. as well. Because which shows that ! a2 = -Ncp' (9!) , from where hence, for a sufficiently large N, !2 < (1 0' ! E'), the maximum likelihood 012 estimator is among the estimators having a minimum fluctuation. Henceforth, we will call this an optimal estimator. Suppose in particular that two sets with N l and N 2 observations, respectively, have been collected, and that the observations within each set follow the same law, that is, there are laws dF l and dF 2 . The maximum likelihood equation for the entire collection of observations is: and put This gives the solution: If we let 9 N i and () N2 be the estimators obtained from each of the two sets separately, one has The optimum estimator for the entire data set is, thus, the weighted average of the optimum estimators obtained from each of the individual sets, with the weights being A!i<7! and N 2 o, 2 2, that is, the reciprocal of the fluctuations !1 and (22 (&dquo; quantities of information&dquo;) of the two estimators. One finds the classical rule for combining observations deduced by Gauss from a principle identical to that of maximum likelihood.
This result highlights again that the rule of maximum likelihood is not valid if applied to only a part of the observations, as the only result worth keeping is that pertaining to the entire set of observations. The rule of maximum likelihood is just a particular case of the rule of the most likely value ; that is the special case where any information about 0 comes through the observations E, while knowledge K obtained previously does not contribute at all, so an uniform prior probability is assigned to 0. Furthermore, it must be observed, with Mr. JEFFREYS, that if one takes any continuous probability law for 0, h (0) dO, having continuous first and second derivatives, the effect of this law on the estimator obtained using the rule of the most likely value with N independent observations is negligible as Noo. In fact, if we let E denote the set of such N observations, and let 7 r (E [ 9) be the corresponding likelihood function, the posterior probability of a value 0 will be 7 r (EIO) h (0) dO, so the most likely value will, thus, be the root of the equation ä log h from where, putting ae = d ( and, rearranging the calculations on page 54 19 slightly, the estimator based on the most likely value is If h (9 0 ) # 0, 1 (0 0 ) and l' (0 o ) are bounded, so when N -j oo, B N -6o ! 9 N -0 0 , with ON being the maximum likelihood estimator; the influence of the prior probability law becomes negligible. However, it must be emphasized that for large but finite N this influence is negligible only if l (Bp) and l' (0 o ) are sufficiently small relative to N; on the other hand, if l' (0 o ) is of the order of N, that is, if the curve representing log h and, hence, that representing h (B) (elementary prior probability) has a sharp peak, this is not so; it is patent, furthermore, that in this case, with the observations K made before E, having already given precise information about 0, then the maximum likelihood 18 Translator's Note: MALTCOT refers to the mode of the posterior distribution. 19 Translator's Note: The reference is to the page of the original paper. MALÉCOT is pointing out towards the developments leading to: in connection with maximum likelihood estimation. 20 Translator's Note: The meaning of elementary, an adjective used often by French mathematicians, is unclear here. Presumably, MALECOT means density, an infinitesimally small element of a probability (in the continuous case). estimator ON deduced from only E, is not the best; it is necessary to combine E with the previous observations by applying the rule of the probable value 21 , which gives the value B N .
Because the mean value of B N is with El being almost surely uniformly small with N, its fluctuation will be N This can be larger or smaller than !2 = 2 (fluctuation of ON) depending on , 0' whether L' (0 o ) is > 0 or < 0, that is, depending on whether the true value 0 0 lies in the neighborhood of a &dquo;valley&dquo; or of a &dquo;peak&dquo; of the curve representing the prior probability h (0). In the case where (' 2 < !2, there is no contradiction with the result given on page 50 22 , because this result establishes that !2 is the minimum fluctuation for all estimators H such that M (H) = 0 for any 0; it can be expected that when one does not have any prior knowledge about the true value 0 0 of 0 the precision of the best estimator will be !2. On the other hand, if one knows that a value 0 0 is more probable than others, the condition M (H) = 0 for any 0 can be a nuisance 23 and give less precision than when would try to estimate in a region near the most probable value. 21 T ranslator's Note: The author probably means &dquo;the most probable value&dquo;. 22 Translator's Note: This is the page of the original paper where the lower bound for the variance of an unbiased estimator is presented. 23 Translator's Note: MALECOT employs the term &dquo;parasite&dquo;. Although descriptive, such a term is not a part of the statistical lexicon in English.

THE PROBLEM OF INDUCTION
The decreasing importance of the prior probability as the number of observations increases describes certain aspects of the problem of induction in a remarkably clear manner. This problem consists essentially of extracting from the results observed a law summarizing them (and which also allows to forecast future results); this law is never dictated by the observed results, rather, it is a construction of the mind chosen for reasons of simplicity or convenience (naturally taking into account all previous experience); one can always suppose many laws; these play the role of the different hypotheses O i of our scheme; each of these, if formulated with sufficient precision, generates the observed results E with a known probability P (E[0 1 K) , the likelihood of B i . The choice between the B i is dictated by the posterior probabilities P (0 j [EK) , depending both on the likelihoods, which are objective (because these depend only on the observations) and on the prior probabilities P (01 [ K) which are more or less subjective; the evaluation of likelihoods is deductive (often in its more refined form, the mathematical deduction); however, the subjective part always enters in the evaluation of prior probabilities, illustrating wonderfully that every induction is subjective. It is true that when the number of observations increases, the subjective part decreases, as we saw previously. Further, the prior probabilities can be right away in more or less agreement with subsequent experience; when KEPLER viewed as very probable that an ellipse would fit his observations on Mars, he was in immediate agreement with all subsequent astronomical observations; on the other hand, the a priori belief that planets moved in circles around the earth led PTOLEMY and his predecessors to formulate laws which, by integrating all past observations, made difficult, because of their complexity, to predict subsequent observations. The scheme a priori was excessively subjective and had to be updated constantly in order to account for new observations. These examples show that as science progresses, that is, as new observations accumulate, its subjective part diminishes, although it would be an illusion to believe that it could be eliminated totally. In fact, experimental progress always allows us to choose, in the long run, between several hypotheses that have been formulated completely (by evaluating their likelihood deduced from all observations made), but we will always be incapable of formulating precisely (that is, making their consequences explicit) all possible hypotheses and, consequently, of calculating the likelihoods of all hypotheses. This is the reason why every law, every possible physical theory, will always become inadequate for explaining new facts: it has been chosen as the most likely of all the laws among those that can be formulated, but more advanced experimentation will make it appear less likely than new laws that one would be led to formulate; in this form, the system of PTOLEMY was replaced by that of KEPLER-NEWTON, and then by relativist mechanics. Each law is valuable for representing both the old field of observations and the new field motivating it; however, the law cannot pretend to represent the totality of future observations, because it is not more than a choice between a small number of laws that our mind conceives and, because of the weakness of our senses and of our mind, these laws are rough and incomplete blueprints of the rich complexity of natural phenomena. Of course, as experiences develop, the increasing finesse of our theories molds reality better but cannot pretend to grasp it completely. &dquo;There are more things in heaven and earth than in all our philosophy&dquo;. There is more complexity in the mechanisms of nature than we can think of and all the laws that we can construct, even if better than the preceding ones, are just an approximation to reality, an approximation that will become insufficient, eventually. OHM's law, although translating electrodynamic phenomena remarkably to our scale, becomes inadequate when an extension of our senses places us at the scale of the electrons, so it becomes just a statistical law. Is it not possible that even the laws of atomic physics behave eventually as statistical laws? A scientific law is never &dquo;true&dquo;, that is, a definitive one, it is only more or less convenient for representing and anticipating phenomena viewed at a certain scale. When it is said that &dquo;a physical theory is justified by its consequences&dquo;, this only has a relative meaning, that is, that among all theories formulated, this is the one having consequences that agree best with the observations. In induction, there are two very distinct parts: a deductive part that formulates the consequences of each hypothesis considered, and a part that is not amenable to deduction and which postulates hypotheses and assigns prior probabilities to these; there is where the genius of invention and the mind are manifested; then, the rest consists in choosing the most probable hypothesis after the consequences. The rule of the &dquo;most probable hypothesis&dquo; underlies every induction, translating precisely the logic of induction and, at the same time, highlighting its subjectivity. It does not seem possible to take the rule of maximum likelihood as a base of the logic of induction, as Mr. FISHER does, because this rule applied to different series of measurements will lead to contradictory consequences (and must be completed using significance tests, which are in contradiction with this rule!), while a logic must be a set of principles from which one can accept all consequences, this being certainly the case, as we have argued, for a logic based on BAYES formula. 5. &dquo;SUBJECTIVE&dquo; AND &dquo;OBJECTIVE&dquo; PROBABILITIES If, with Mr. DE FINETTI (6), we view probability theory as a &dquo;logic of subjective judgements&dquo;, how is it possible to have an agreement between statements derived from this logic and the objective reality? This is the objection made frequently to the formula of BAYES. The arbitrary form in which prior probabilities are evaluated confers a similar arbitrariness to the evaluation of posterior probabilities. Now, aren't there events whose probabilities have an objective meaning, as suggested by an agreement between observed frequencies and probabilities assigned by an a priori reasoning? We believe that the remarks made previously permit responding to this objection. Every evaluation of probabilities is a construct of the mind, and relative to a theoretical setting imagined by the mind to limit our ignorance, and based on the principle of indifference. For example, the statement that the value 6 in the toss of a die has a probability of 6 is, at the same time, the result of ignorance about the movement of the die in the dice-box, and of the statement that there is no reason to believe that this movement favors a side over the others, hence all sides observed. At any rate, in the evaluation of probabilities, there will always be hypotheses a priori that, although more or less suggested by previous observations, will never dominate absolutely, will never be certain a priori, this being so because it is never possible to know the totality of circumstances giving rise to a phenomenon. (In passing, we dismiss the objection that it is not possible to speak about &dquo;probabilities of causes&dquo; because these would not be &dquo;random&dquo;, one must be &dquo;true&dquo; and the others &dquo;false&dquo;: if one admits determinism, the same is true of the effects; in fact, it is not the phenomena that are random, rather, it is the knowledge that we have about them; the probabilistic logic attempts to identify the limits of our ignorance). The role of experimentation is to confirm or question some of the assumptions made or, more generally, to update their probabilities; if one of these appears clearly as more probable than the others, it would be retained as the best, but it should be kept in mind that this superiority is temporary, and that the hypothesis could be demolished by subsequent experimentation. For example, consider games of chance, such as playing dice, to illustrate ideas. Experience has led us to abandoning the hypothesis, which perhaps may be natural for a primitive mind, that there is an influence of the player on the outcome, and to adopting the assumption that all sides of the die are equally likely, as the best explanation for the observed results. However, Weldon's experiments show, in turn, that this assumption is false, as the theoretical scheme of the perfect die does not hold in practice; there are always 1 some sides that are favored: the probability of 6 is then relative to a theoretical scheme deduced from reality by abstraction and simplification, and it will never be the limit of the observed frequencies.
What makes the theoretical scheme appealing is its convenience: with everything kept simple, it summarizes with sufficient precision the main aspects of an experiment, and it can be expressed through formulae that are simple and, at the same time, that allow making forecasts having a good precision.
As it has been stated by Mr. DARMOIS (2): &dquo;making a probability calculation in a specific case, requires seeing clearly all that it is necessary to know, such that the study follows closely the essential circumstances of the phenomenon considered&dquo;. Thus, the evaluation of a probability always results from a theoretical scheme permitting to assess, with more or less precision, the equal or unequal probability; it is completely legitimate, as stated by Mr. BOREL, to evaluate the probability of an isolated event provided that a scheme can be conceived where this probability is related to other known ones (for example in a lottery scheme) 24 . However, the probabilities thus calculated will not be in reasonable agreement with the observed frequencies unless the theoretical scheme is in sufficient agreement with the real mechanism, for example, the equi-probable cases corresponding with the equally frequent cases, and this will happen when the scheme has been established after considering a sufficiently large number of experiments. It is in this situation that an &dquo;agreement between 24 Translator's Note: It is unclear what MALTCOT means here. In the original paper, he stated: &dquo;Ainsi L'evaluation d'une probabilite resulte toujours d'un schema theorique permettant d'evaluer, avec plus ou moins de pr6cision, 1'6gale ou lin6gale probabilite ; il est tout a fait legitime, comme le remarque M. BOREL, d'evaluer la probabilite d'un 6v6nement isol6 d!s qu'on peut concevoir un schema ramenant cette probabilite a d'autres connues (par exemple, un sch6ma du tirage au sort)&dquo;. individual opinions&dquo; (DE FINETTI) or an &dquo;agreement between equally well informed minds&dquo; will be obtained, a condition that Mr. BOREL confers to an &dquo;objective probability&dquo; (which, furthermore, is not a sufficient condition because errors of judgment or of expertise can be committed unanimously).
On the other hand, if the scheme is established from a weak knowledge about facts, the probabilities that can be deduced have the risk of not bearing any relationship with reality. This is what makes Mr. DE FINETTI to write: &dquo;if one does not want to take subjective factors into account explicitly, the question should be abandoned, by stating that it is not sensible&dquo;. This is scarcely a reason-the opposite, rather-for rejecting the formula of BAYES, since there is a need for adopting a position (DE FINETTI,(6), p. 26) 25 . The question brings into perspective the subjectivity of this view, as it was done in the linkage example. Also, the criticism of the formula made by Mr. NEYMAN (15) is somewhat surprising. Mr. NEYMAN takes as an example a set of individuals I, all dominant for a Mendelian factor 26 ; it is wished to use those having the homozygote genotype AA, and to discard the hybrid types (Aa); to do this, each I is crossed with an aa, and the k descendants from this cross are observed; if aa types are observed within these, then I is discarded, naturally; on the other hand, I is kept if the k descendants are of the dominant type. However, in so doing, some of the individuals I kept will be of the undesirable type Aa; the problem is the evaluation of the risk of such an error. Because an I of the AA type produces only dominant descendants, and an I of the type Aa Bk gives k descendants that are all dominant with probability C 2 I the posterior probability of keeping an I which will be Aa, using BAYES formula, and letting p o be the prior probability that I is Aa will be: It is clear that if p o is &dquo;objective&dquo;, that is, if it reflects an observable frequency, then p i provides a forecast of the frequency of errors. If, for example, it is known that the I individuals come from crossing heterozygotes, one would take po = 3, representing the frequency of heterozygotes in a large number of individuals I examined. Then: would sensibly represent the proportion of individuals that, although kept, possess the Aa type, that is, the proportion of errors. However, if the origin of I and, hence p o , is unknown, the equation evidently looses part of its specific meaning. Should one, then, with Mr. NEYMAN, declare it useless? 27 . It is clear 25 Translator's Note: I have translated &dquo;adopter une ligne de conduite&dquo; as &dquo;for adopting a position&dquo;.
Translator's Note: Although perhaps obvious from the context, MALECOT means that the set I includes individuals with at least a copy of the allele A. 27 Translator's Note: The author refers to BAYES formula here. at the onset that no other formula, in the absence of additional experiments, can give us the proportion of errors, because from the equation, this is linked to p o , and this is unknown. Any estimation of error needs a judgement, explicit or not, about the value of p o , and in the formula of BAYES this judgement must be made explicit. The formula shows, for example, that if k = 6, the statement that there is at least 1 error in 65 is equivalent to stating that p o is ! -, 2 which may or may not be viewed as reasonable depending on the information available about how the individuals I were obtained. None of the two statements has a stronger foundation than the other, and any reasoning attempting to give more credibility to the preceding one would be erroneous. BAYES formula, establishing an exact correspondence between the &dquo;prior&dquo; and the &dquo;posterior&dquo; probabilities shows clearly that a judgement based on the latter ones is equivalent to a judgement on the former ones, and that this is unavoidable, except in some special cases to be discussed in Section 7. Further, this formula has value for the interpretation of subsequent experiments: if these involve a genetic analysis of the individuals I kept, from which it follows that the frequency of errors can be evaluated, this leads to an &dquo;objective&dquo; value of p l , that is, of the composition of the initial population, information which may be precious for other experiments.

NEYMAN'S POINT OF VIEW
After having shown that the statistical ideas advanced by Mr. Fisher's school of thought cannot be justified logically without introducing the &dquo;rule of the most probable value&dquo; deduced from BAYES formula, we will consider now the methods with which Mr. NEYMAN has thought it is possible to bypass this formula while providing &dquo;objective&dquo; criteria, expressible in terms of frequencies. The problem, as posed by Mr. NEYMAN, is to decide if a hypothesis H o is to be &dquo;rejected&dquo; or &dquo;accepted&dquo; according to whether the point E having as coordinates the N observed values :ri,...,:E!, is found inside of a certain &dquo;critical region&dquo; w or inside of a complementary region ill of the Ndimensional space J22 N (&dquo;observations space&dquo; ) (classical examples: significance of the difference between a theoretical mean and an observed mean, by comparing their difference with their standard error; assessment of goodness of fit with the x 2 method). This decision can produce an error in two different manners: if H o is rejected when it holds true, one makes a type-1 error (the only one that is classically taken into account in the two preceding examples). If one accepts H o when it is false, a type-2 error results. The idea of Mr. NEYMAN is evaluating the probabilities of these two errors separately and &dquo;objectively&dquo;, that is, to predict their frequencies (by deduction and not by induction, as emphasized by Mr. NEYMAN).
Consider the case where the hypothesis to be examined concerns the value of a parameter B intervening in the probability law f (x, 0) taken for each observation x. Because the function f is supposed to be known, one can calculate, as a function of 0, the probability that the point 3 ;i,...,. T j B r falls in the critical region w. This probability, P (E c w[0) = 0 (0, w) is called &dquo;power function&dquo; of the criterion based on w. If the hypothesis H o to be examined attributes a value 0 0 to the parameter, the probability of a type-1 error calculated under hypothesis H o will be 0 (B o , w), and that of a type-2 error, calculated supposing that the true value is 0 1 will be (3 (()l, w) = 1 -(3 (()l, w).
Mr. NEYMAN proposes first to reduce the probability of errors of the first type to a fixed, sufficiently small value, a, defining a family of &dquo;equivalent critical regions&dquo; w in terms of the formula /3 (0 0 , w) = a: then, attempt to choose one of these regions such that the type-2 error is as small as possible, and this for any 0 1 in a certain domain; hence, this defines a criterion that is &dquo;uniformly most powerful&dquo; in this domain (but this criterion exists only for very specific laws f and, provided that the domain is restricted sufficiently. This is the reason why the domain is often restricted to the neighborhood of 0 0 ). Our first criticism is as follows: why would one want first to minimize the type-1 error? Mr. NEYMAN points out to a case where the consequences of a type-1 error would be much more important than those of a type-2 error: for a pharmacological product which, by accident, can contain a toxic substance, and which has been assayed previously on some animals, it is essential not to discard the hypothesis H o : &dquo;the product is dangerous&dquo;, because it is accurate; however, the consequences are not serious if this hypothesis is kept, even if it is false; the problem is, then, essentially, one of reducing the type-1 error. However, this is a very particular situation. In general, the cases where one will be concerned about the type-1 error are those where a priori there are strong reasons to believe that H o is accurate: in fact, reducing the type-1 error leads, most of the times, to an increase of the type-2 error in the neighborhood. If one can vary B in a continuous manner and if (3 (B, w) is a continuous function of 0, the two errors become evident in the curve representing the function, because the corresponding probabilities are, respectively, the ordinate at abscissa 0 0 (where 0 0 is the value under scrutiny) and the complement to 1 of the ordinate with abscissa 0 1 (B l = true value); even if the region w is chosen such that one has a uniformly most powerful criterion, in those rare cases where it exists, it is still true that a reduction of a will cause in general a reduction of the neighboring coordinates, that is, an increase of the type-2 error, provided the true value 9 1 is not too far from the value 0 0 under scrutiny. For example, in the estimation 1 of linkage, it is frequent to reject the hypothesis r = 2 if the estimate of r obtained from the experiments it is away from -2 by more than A times its standard error. The larger A is, the smaller the risk of rejecting the hypothesis r = 2 if this holds; however, there will be some risk of discarding the hypothesis that r has a value other than -2 but near -2 when this hypothesis is true. In general, the weight to be assigned to the two types of error, that is, the choice of a, depends inevitably on assumptions made a priori about the probabilities of H o and of the other hypotheses. The method of Mr. NEYMAN cannot pretend to give an &dquo;objective&dquo; judgement about H o ; its appeal resides in making the distinction between the two distinct classes of error, but it is incapable, in the absence of any consideration a priori, of assigning appropriate weights to the two; now, the more clear manner of incorporating a priori considerations is to introduce prior probabilities; if these are subjective, so be it.
Let us go further: this method not only does not permit to evaluate the global frequency of errors in the absence of knowledge of prior probabilities, as acknowledged by Mr. NEYMAN, but it does not allow evaluation of the frequency of errors of each type and, contrary to what seems to be stated by Mr. NEYMAN, it does not furnish any observable frequency. In fact, 0 (0 o , w) just measures the frequency of errors of the first type that would take place if H o were always true; 1-0 (0 1 , w) measures the frequency that the errors of the second type would have provided the hypothesis 0 = 0 1 were always true; now, in practice, we do not have any certainty about these hypotheses, this being precisely the reason why we wish to arrive at a probabilistic judgement about these; hence, we are incapable of predicting to what extent the real frequencies of these errors correspond to the preceding probabilities unless, naturally, one knows for the different values of 0 the &dquo;objective&dquo; prior probabilities, that is, expressible in terms of frequencies.
Let K be the prior probability that the hypothesis 0 = 0 0 holds and (1 -K) dg (0 1 ) (STIELTJES' differential) be the prior probability that 0 = 0 1 # Oo (f L dg (0 1 ) = 1, with L denoting the domain of variation of 0 1 , excluding 0 0 ); the posterior probabilities, when it is known that the observations have given a result falling in w, are respectively proportional to: giving as posterior probabilities of the errors of the first and second types: (probability that H o is true given that the observations fall in w, leading to rejection of Ho).
(probability that H o is false given that the observations fall in to ill, leading to acceptance of Ho). 28 Translator's Note: Without warning, MALTCOT changes the notation /3 (B, w) to 13 (9 [ w) hereinafter.

The posterior probability of any error is:
It is seen that the prior probabilities (K and g (0)) intervene in an essential manner in the expected frequencies of the two errors and in the weights to be assigned to these. The coefficients by which ¡3 (8 l lw ) 29 and 0 (8llw) must be weighted are the prior probabilities K and (1 -K) dg (8 d ; the choice of the size of a, for which Mr. NEYMAN does not offer any guidance, is implicitly equivalent to an assumption about the prior probability K of 0 0 ; by considering only the type-1 error and minimizing a (as in the usual case of evaluating the significance of deviations, or in the x2 test) this is equivalent to supposing that K is close to 1 so that (1 -K) f ¡ 3 (0 1 IT) dg (0 1 ) in P is negligible relative to Ka (although the value of the integral, ranging between 1a and 0 in the usual case where a (8Iw) is minimum for 0 0 , can be of the order of 1a for certain laws of the prior probability dg (0 1 )

THE &dquo;CONFIDENCE INTERVALS&dquo;
The problem has been addressed in a different form by several authors, and by Mr. NEYMAN in another report (13). We shall modify the presentation of his theory by introducing prior probabilities. Let dg (0) be the prior probability of an unknown parameter intervening in the probability law of the random variable under study (this parameter can vary within an interval a ... b which we shall denote as L), and let E i (i = 1, 2, ... , n) be the different possible outcomes (these being mutually exclusive) of the set of possible experiments involving this random variable. For each possible E i we introduce a corresponding &dquo;estimating set&dquo; (supposed to be measurable) O i contained in L, and we shall agree that if E i is observed, the true value of B will be regarded as belonging to the corresponding Oi. If O i is an interval, we shall refer to it as a &dquo;confidence interval&dquo; associated to E i .
(The situation in Section 6 was one where the E i were distributed only into two categories, w and w, and where the corresponding estimating sets are 0 # 0 0 and 0 = 0 0 , thus non-overlapping; what it is different now is that the estimating sets 8 i corresponding to the different values of i can overlap). Let again 7 r (E i I8) denote the probability of observing E i when the parameter has value B; the total probability of observing E i is BAYES formula gives as posterior probability that 9 is not in 8 i (i.e., that it belongs to the complementary set L -O i ), given that E i has been observed: 29 Translator's Note: MALECOT probably means ,(3 (Bp!zv).
consequently, the total prior probability that the rule &dquo; B is in 8 i when E i has been observed&dquo; leads to a false statement is: The interesting aspect of this formula is that, by choosing the 8 i conveniently, is it possible to arrange it such that 7 is always smaller than a fixed limit, irrespective of the prior probability law g(0) of the parameter; suppose that when 0 varies in the interior of L -8 i , 7 r (E i [0) # 6, with 8 being a limit independent of i, which can be reduced arbitrarily by reducing the L -8 i ; the formula of the mean then gives that and the sum inside the brackets cannot increase when the sets L -8 i are reduced and, hence, in particular, when 6 is reduced; hence, this can be made arbitrarily small, which proves the statement. Therefore, one can always choose the O i such that, without knowing anything about g (B), it is assured that the probability that the rule adopted leads to an error that is smaller than a fixed number e, hence, on average, one will make mistakes in a proportion of experiments that is smaller than E . Thus, one can speak of an &dquo;objective&dquo; probability of error and &dquo;independent of the prior probabilities&dquo;; however, it should be pointed out that limiting &dquo;objectively&dquo; the probability of error has a penalty in terms of reduced precision of a statement concerning 0; first, by use of the rule stated, we arrive only at the statement &dquo;B is in a given set&dquo; and not: &dquo;0 has a specific value&dquo;; then, if the objective of the experiment is to judge a specific value of 0 deduced from a theory, or to obtain a numerical value permitting subsequent evaluations, this value can be examined only in the light of certain prior probabilities, as we established in Section VI. Besides, even if one is satisfied with giving an indeterminate answer within a certain set, it must be noted that the sets 8 i corresponding to the different results E i could have considerable overlap, and in some cases there could be a part common to all 8 i ; hence, the method will often be unable to choose, after the experiment, one set from a collection of overlapping sets, but will just allow to keep after the experiment a certain number of sets from this group without being able to choose among these (perhaps even some of these sets will never be rejected, irrespective of the results!). Nevertheless, these remarks should not make loose sight of the attribute of the method, which is to provide an upper limit for the probability of error that is completely independent of the prior probabilities, a limit which will be usable only in the case where we do not know absolutely anything about the latter.
The result is extended easily, by modifying the notation slightly, to the case where all the possible results form a measurable continuum 9 in a space J22. If one lets 7 r (E[0) dE be the probability that when the parameter has value 0 a result belonging to an element with volume dE is observed around a point E, and 6 (E) be the estimating set (supposed to be measurable) associated with E, and if one adopts the rule &dquo;state that when one observes E, then B is in 19 (E)&dquo;, the prior probability that this statement will be false is: To be more specific, let us adopt the presentation of Mr. NEYMAN, and put in brackets a generalization of his statements. Let E be the experimental point (set of N observations x l , X2 , --xnr) describing a continuum 3 in a space !J2N; to each value 6 0 of the parameter we associate an &dquo;acceptance set&dquo; A (B o ) &dquo;of size equal [or larger than] to a&dquo;, which by definition is a measurable set (function of 0 0 ) of points in !J2N chosen such that the probability that E belongs to this set, calculated under the hypothesis 0 = 0 0 , is equal [or larger than] to a. Further, associate to each experimental point E the set 8 (E) of values of 0 0 for which A (0 0 ) contains E; this set 8 (E) will be called &dquo;estimating set of 0, with a confidence coefficient equal [or larger than] to c!&dquo; . If, for each E observed, we agree to state that the true value of B is in the interior of the corresponding 8 (E), it is easy to show that the total prior probability that this rule leads to an error is independent of the prior probability of 0 and is equal [or smaller than] to 1-a. In fact, this probability -y is given by the above formula, that is, by a multiple integral over the domain: (because there is a logical equivalence between the two propositions: &dquo;E is not a part of A (0)&dquo; and &dquo;0 0 is not a part of O (E)&dquo; ) enabling us to write also: However the integral to the right is, for any 0, by definition of A (0), smaller or equal to 1a, the same holding for -y, thus completing the proof. This proof puts in evidence, better than that of Mr. NEYMAN, the class of trials on which the probabilities are defined: is the set of all possible trials from all possible values of 8 distributed according to an unknown law dg (0).
Mr. NEYMAN uses well the logical equivalence between the 2 propositions noted above, but he does not emphasize that this does not imply the equality of their probabilities unless these are defined over the same class of trials. For example, this would not give the probability of error in the set of cases where we observe a given E i event (selection of results), because, from the formula on p. 68!° giving Q i , it would be necessary to know the prior probability of this event. If there is any conceptual confusion concerning the probability 1-a attached to a confidence interval, it is because there is an incomplete definition of the corresponding category of trials. It seems to us that one must see there a posterior probability of error calculated over the set of all possible trials, 30 T ranslator 's Note: Page 68 of the original paper. and independently of the prior probability of 0, thus &dquo;objective&dquo;. What Mr. FISHER cautiously calls &dquo;fiducial probability&dquo; is a true probability, as rightly observed by Mr. NEYMAN.
There is a well known application of this theory, this being the rule of &dquo;STUDENT&dquo;. If the x i are N observed, independent, values with mean x of the same random variable following the law of LAPLACE-GAUSS with unknown expectation 0, we can take as estimating set with a confidence coefficient a the &dquo;confidence interval&dquo; with t linked to a through the formula: The statement that 0 belongs to such interval would give a frequency of errors equal to 1a, over a long series of experiments of the same type, and where there is no selection of results.
The theory of confidence intervals can be combined with that of estimation. Often, for a parameter 0 with unknown true value 0 0 , one possesses an estimator E deduced from a large number N of observations, that it is correct 3l , asymptotically Gaussian, and with a known standard error, which is a function of 0 0 , that is, 0' (() o ). The interval B o -A u (0 o ) ... 0 0 + Aa (0 o ) is for E an acceptance set of size a connected to the &dquo;critical coefficient&dquo; A by the formula if, within the interval where 0 0 can vary, a (0 0 ) admits an upper limit or, it is seen that the interval E'&mdash;A<7... !+A<7, entirely determined by the observations, will be a confidence interval for 0, with a confidence coefficient larger or equal than a.
In particular, if E is the maximum likelihood estimator, hence one of those minimizing a (e o ), and if the margin of uncertainty about 0 0 is small enough such that 0' (E) is not too different from 0' (eo), the interval E &mdash; Aa (E) ... E + Aa (E) will give a confidence interval of size a for 0, and it will be, among all confidence intervals of size a derived from different C.A.G. estimators, the smallest one. This is why the rule indicated has practical value, by giving a maximum reduction of the uncertainty about 0 while maintaining an &dquo;objective&dquo; probability of error (besides, as suggested already in Section 2, this rule has the effect of grouping the value with maximum likelihood, very 31 T ransl ator's Note: This means consistent, as seen earlier. unlikely by itself, with the neighboring values; however, we have now replaced the consideration of posterior probabilities of different values, which depend on the prior probabilities, by that of the total probability of error, which does not depend on these).
Nevertheless, it must be pointed out that possessing certain information about the prior probability of B is necessary and sufficient to reduce even more the interval without increasing the probability of error 1a. In particular, one could not logically take a specific value of the interval without making assumptions, explicitly or not, about the prior probabilities. If, for example, an interval containing an integer value of B has been obtained, adopting this value of B rather than the estimate E will often depend on theoretical considerations a priori (for example, if 8 is the linkage coefficient r defined already on page 47 32 , or if it is an atomic weight).
To finish, let us give an example of a confidence interval based on a small number of observations. Suppose, with Mr. FRECHET, that from an urn with a completely unknown composition, a single ball (suppose it is white) is drawn.
What can we say then about the probability of drawing balls of the same color?
If p is the (unknown) value of this probability and f = 0 or 1 is the frequency of white balls that can be observed in a single draw, an acceptance set of size > _ 0 : would be defined by: The confidence intervals for p with coefficient >_ a can be deduced to be: which implies, to clarify the ideas, that if one repeats the experiment in a large number of urns having an arbitrary composition, and that if one states each time that the prior probability of the observed result, no matter what this is, 1 would be !! I , one would be wrong in at most 1 of every 100 such trials.

! 100
On the other hand, it is impossible to bound the probability that, in the case that one observes a white ball (selection of results), one makes a mistake by 1 stating that the probability of whites is ! 100 (it is evident that all urns could 100 contain less than 100 of whites). The criterion does not allow us to choose 100 one among several hypotheses, these being stated before the experiment and mutually exclusive, for example, between the hypotheses p > &OElig;, &OElig; ! P ! 1 -a, p < 1-a; it only enables us, after each experiment, to reject a single one among these 3; it does not permit us, ever, to reject the second one because this is the common part of the 2 confidence intervals. This illustrates the remarks made on page 69 33 . 32 Translator's Note: The page number of the original paper. 33 Translator's Note: The page number of the original manuscript.
In summary, we see that the theory of confidence intervals allows us to make &dquo;objective&dquo; judgements free from a frequency of error that is known or bounded, but only in the following form: after the experiment, discard certain intervals where the bounds depend on the results of the experiment; however, this does not permit us to choose a given value, or often to choose between one or several values fixed a priori, so it becomes indispensable (unless one refuses to make this choice) to invoke a scheme of prior probabilities formulated in a more or less clear manner. This is necessary if one wishes to take into account previous experiments, unless their benefits are dispensed with willingly, as pointed out by STUDENT in the title of one of his tables (JEFFREYS, (10), p. 310).

INDETERMINACY OF A SET OF HYPOTHESES
In the preceding development, it was supposed that the probability law is known perfectly once 0 is fixed and, hence, that all the consequences of all such possible hypotheses can be stated. In practice, as we observed in Section 4, this is not so: the hypotheses that one can state, and their consequences do not cover in an exhaustive manner the field of all possible hypotheses, so the sum of their probabilities, a priori or a posteriori, give a number < 1; the rules that we have given lead one to making a choice between the hypotheses stated, but do not prejudge at all about the probabilities of those that have not been formulated yet, and these may be appreciable, because the history of scientific theories is the history of the abandonment of old hypotheses and of the keeping of the newly formulated ones. For example, when a law f (x, B) derived from theoretical considerations is fitted to data, it would be better to avoid suggesting, in agreement with Mr. MATHER, that all that can be extracted from the observations can be summarized in a confidence interval about B, and it should be always kept in mind that f (x, B) may be inexact! Certainly, in general, we will be incapable of formulating precisely all alternatives to the validity of f (!, B), but it would be prudent to reserve a non-null prior probability for these alternatives, which will avoid a situation where f (x, 0) receives a brutal refutation in the case that, subsequently, the alternatives become more plausible and their posterior probabilities increase, at the expense of that of the former! As it has been said by CLAUDE BERNARD, we should not forget that the scientist must sacrifice as many theories as needed, &dquo;like the general that has had many horses killed but that still advances&dquo; .