1 Introduction

Conditional probability is, on the one hand, an intuitive concept, which captures the change in the original probability assignment when new information is known. On the other hand, the axiomatic definition of conditional probability is given by a formula that determines it from the original probability. Often both concepts are identified, and it is postulated that the incorporation of new information alters the original probability assignment according to this formula.

As always when an axiomatic definition is applied, it is worth discussing its applicability in each case. Indeed, when considering the frequentist interpretation of probability, there are plausible reasons for such applicability. In the case of the subjective interpretation of probability, as a degree of belief, typical of Bayesian statistical inference, arguments have been constructed to justify that the change in the assignment of probabilities when new information is incorporated must follow the conditional probability formula. These arguments start from a qualitative relation of the form \(A| B\succsim C| D\), meaning “A given B is qualitatively at least as probable as C given D”, satisfying certain elaborated assumptions (see [8]). Then it is proved that there is one and only one probability P such that

$$\begin{aligned} A| B\succsim C| D\text { iff }\frac{P(A\cap B)}{P(B)}\ge \frac{P(C\cap D)}{P(D)}. \end{aligned}$$

This result is to be understood within measurement theory, where the representation by probabilities of qualitative probability orderings of events is discussed; usually finitely additive probabilities have been considered, although completely additive probabilities have also been studied (see [11]).

We consider in this paper a different starting point to justify the applicability of the axiomatic definition of conditional probability (i.e. the conditional probability formula). The original probability measure is taken as given, and an assumption on the relation between this original probability and a possible updated conditional probability is imposed (Aristotelian Assumption, (A.A) for short). Provided that the original probability is non-atomic, it is proved that there is one and only one transformed probability measure satisfying the assumption (Theorem 7).

This result applies to Bayesian statistics. We recall that Bayesian inference relies on the use of the conditional probability formula to update probability assignments when new information is incorporated. For simplicity, we take momentarily all probability distributions to be representable in terms of densities. Suppose that \(Y=(Y_{1},...,Y_{n})\) is a random vector of n observations taking values on a sample space S. The parameter \(\theta =(\theta _{1},...,\theta _{k})\) with values in a parameter space \(\Theta \subseteq \mathbb {R}^{k}\) indexes the various possible density functions \(p(y| \theta )\) for Y; so \(p(y| \theta )\) denotes the distribution of Y when \(\theta \) is known. Bayesian statistics postulates that \(p(y| \theta )\) represents a conditional distribution following the conditional probability formula. Thus \((Y,\theta ) \) has a probability distribution (say with joint density \(p(y,\theta )\); p(y) and \(p(\theta )\) stand for the density marginals) and

$$\begin{aligned} p(y| \theta )p(\theta )=p(y,\theta ). \end{aligned}$$
(1)

On the other hand, given the observed data \(y=(y_{1},...,y_{n})\), let \( p(\theta | y)\) denote the distribution of the parameter \(\theta \) when y is known. Bayesian statistics now postulates that \( p(y| \theta )\) represents a conditional distribution following the conditional probability formula. Thus

$$\begin{aligned} p(\theta | y)p(y)=p(y,\theta ). \end{aligned}$$
(2)

Equating (1) and (2), Bayes’ formula for the posterior distribution follows:

$$\begin{aligned} p(\theta | y)=\frac{p(y| \theta )p(\theta )}{p(y)}. \end{aligned}$$
(3)

In general, Bayes’ formula for the posterior distribution is certainly the basis of Bayesian statistics. Two hypotheses are underlying this formula:

(H1) There is a joint probability measure P on \(S\times \Theta \).Footnote 1

(H2) If P(A|C) is given the interpretation “probability of event A when event C is known”, then the conditional probability formula applies:

$$\begin{aligned} P(A| C)=\frac{P(A\cap C)}{P(C)} \end{aligned}$$

for P-measurable A and C, with \(P(C)>0\).

In the Bayesian parametric model, the joint probability P is shown to be non-atomic (Proposition 9). Taking (A.A) for granted, it follows from Theorem 7 that, at least in the parametric case, condition (H2) is redundant, and only (H1) is necessary for the Bayes’ formula for the posterior distribution.

2 The formula of conditional probability

In this section \((\Omega ,\mathcal {A},P)\) is a probability space, where \( \Omega \) is a set, \(\mathcal {A}\) is a \(\sigma \)-algebra in \(\Omega \) and P is a (\(\sigma \)-additive) probability measure. Let \(C\in \mathcal {A}\), with \( P(C)>0\).

Definition 1

Let \((\Omega ,\mathcal {A},P)\) be a probability space and let \(C\in \mathcal {A }\) with \(P(C)>0\). The probability space \((\Omega ,\mathcal {A},P^{\prime })\) is called a pre-conditional probability space given C iff \(P^{\prime }(C)=1 \) and the following assumption hold:

(A.A) If \(A,B\in \mathcal {A}\) and \(A,B\subseteq C\), then

$$\begin{aligned} P(A)=P(B)\text { implies }P^{\prime }(A)=P^{\prime }(B). \end{aligned}$$

This definition arguably captures obvious requirements for any re-assignement of probabilities when we have the added information that the outcome is one of the elements of the event C. The requirement \(P^{\prime }(C)=1\) says simply that “the outcome is one of the elements of the event C”. Besides, the original assignment of probabilities has to have an influence on the new assignment, and not merely be thrown away. It has to be re-worked in an even-handed way, and (A.A) is in this sense a minimum requirement, expressing some sort of Aristotelian “treat like cases alike” principle.Footnote 2

Assumption (A.A) may be even unconstraining.

Example 2

Consider that \(\Omega :=\left\{ 1,2,3,4\right\} \), \(\mathcal {A}\) is the set of all subsets of \(\Omega \), \(P(1){:=}\frac{1}{10}\), \(P(2){:=}\frac{3 }{10}\), \(P(3){:=}\frac{5}{10}\), \(P(4){:=}\frac{1}{10}\), and \(C{:=}\left\{ 1,2,3\right\} \). Then any probability space \((\Omega ,\mathcal {A},P^{\prime })\) is a pre-conditional probability space given C, provided that C is a support of \(P^{\prime }\) (i.e. \(P^{\prime }(C)=1\)).

The set function \(P(\cdot | C)\) on \(\mathcal {A}\) defined by

$$\begin{aligned} P(A| C){:=}\frac{P(A\cap C)}{P(C)} \end{aligned}$$
(4)

makes \((\Omega ,\mathcal {A},P(\cdot | C))\) into a pre-conditional probability space given C. We are interested in the question of its uniqueness as pre-conditional probability space given C.

Remark 3

It is immediate that, if a probability space \((\Omega ,\mathcal {A},P^{\prime })\) satisfies \(P^{\prime }(C)=1\), then the following three conditions are equivalent:

  1. (i)

    \(P^{\prime }=P(\cdot | C)\), as defined in (4).

  2. (ii)

    If \(A\in \mathcal {A}\) such that \(A\subseteq C\), then

    $$\begin{aligned} P^{\prime }(A)=\frac{P(A)}{P(C)}. \end{aligned}$$
    (5)
  3. (iii)

    If \(A,B\in \mathcal {A}\) such that \(A,B\subseteq C\) and \(P(B)>0\), then \( P^{\prime }(B)>0\) and

    $$\begin{aligned} \frac{P^{\prime }(A)}{P^{\prime }(B)}=\frac{P(A)}{P(B)}. \end{aligned}$$

Recall the following definition.

Definition 4

\(A\in \mathcal {A}\) is an atom for P iff: (a) \(P(A)>0\) and (b) for every \(B\in \mathcal {A}\) with \(B\subseteq A\), either \(P(B)=0\) or \(P(B)=P(A)\). A probability measure P which has no atoms is called non-atomic.

A probability measure P is called atomic iff every \(E\in \mathcal { A}\) such that \(P(E)>0\) contains an atom. If P is a probability measure, then there exist unique probability measures \(P_{1}\) and \(P_{2}\) and \(\alpha \in \left[ 0,1\right] \) such that \(P=\alpha P_{1}+(1-\alpha )P_{2}\) and such that \(P_{1}\) is atomic and \(P_{2}\) is non-atomic (see [7] for further discussion in the general context of measures).

The following result is a particular case of a theorem of Sierpinski [10].

Theorem 5

Let \((\Omega ,\mathcal {A},P)\) be a probability space with P non-atomic. If \(E\in \mathcal {A}\) and \(P(E)>0\), then for every \(\alpha \in \left[ 0,P(E)\right] \) there is an element \(F\in \mathcal {A}\) with \( F\subseteq E\) and \(P(F)=\alpha \).

Induction on k gives directly the next corollary of Theorem  5 (see [9]).

Corollary 6

Let P be non-atomic, and suppose \(E\in \mathcal {A}\) such that \( P(E)>0\). Let \(\alpha _{i}\) for \(i=1,...,k\) be real numbers with \(\alpha _{i}>0\) and \(\sum \nolimits _{i=1}^{k}\alpha _{i}=P(E)\). Then E can be decomposed as a union of disjoint sets \(E_{i}\in \mathcal {A}\) with \( P(E_{i})=\alpha _{i}\) for \(i=1,...,k\).

Provided that a probability measure is non-atomic, we are going to see that any pre-conditional probability is determined by the conditional probability formula.

Theorem 7

Let \((\Omega ,\mathcal {A},P)\) be a probability space and let \(C\in \mathcal {A}\) with \(P(C)>0\). Suppose that \((\Omega ,\mathcal {A},P^{\prime })\) is a pre-conditional probability space given C. If P is non-atomic, then \(P^{\prime }=P(\cdot | C)\) as defined in (4).

Proof

Let \(A\in \mathcal {A}\) such that \(A\subseteq C\). In order to prove (5 ), it can be assumed, without loss of generality, that \(P(A)>0\). The proof will be divided into three steps.

(a) Consider the case \(\frac{P(A)}{P(C)}=\frac{1}{q}\), where \(q\in \mathbb {N}\), \(q>0\).

Applying Corollary 6 to C, with \(\alpha _{i}=\frac{1}{q}P(C)\) for \( i=1,...,q\), there exist disjoint sets \(C_{1},...,C_{q}\in \mathcal {A}\) such that \(\bigcup \nolimits _{i=1}^{q}C_{i}=C\) and \(P(C_{i})=\frac{1}{q}P(C)=P(A)\) for \(i=1,...,q\). By (A.A), \(P^{\prime }(C_{i})=P^{\prime }(A)\) for \( i=1,...,q \), and thus \(P^{\prime }(A)=\frac{1}{q}P^{\prime }(\bigcup \nolimits _{i=1}^{q}C_{i})=\frac{1}{q}\). Therefore \(P^{\prime }(A)= \frac{P(A)}{P(C)}\), which is our claim.

(b) Consider the case \(\frac{P(A)}{P(C)}=\frac{p}{q}\in \mathbb {Q}\), where \( p,q\in \mathbb {N}\), \(p,q>0\), \(p\le q\).

Applying Corollary 6 to A, with \(\alpha _{i}=\frac{1}{p}P(A)\) for \( i=1,...,p\), there exist disjoint sets \(A_{1},...,A_{p}\in \mathcal {A}\) such that \(\bigcup \nolimits _{i=1}^{p}A_{i}=A\) and \(P(A_{i})=\frac{1}{p}P(A)=\frac{1 }{q}P(C)\) for \(i=1,...,p\). Since \(\frac{P(A_{i})}{P(C)}=\frac{1}{q}\) for \( i=1,...,p\), it follows from case (a) that \(P^{\prime }(A_{i})=\frac{P(A_{i}) }{P(C)}\) for \(i=1,...,p\). Therefore \(P^{\prime }(A)=P^{\prime }(\bigcup \nolimits _{i=1}^{p}A_{i})=\frac{p}{q}=\frac{P(A)}{P(C)}\).

(c) Consider the general case \(\frac{P(A)}{P(C)}=\beta \in ] 0,1]\).

There is a strictly increasing sequence \((\beta _{n})\) in \(] 0,\beta [ \cap \mathbb {Q}\) such that \(\lim \limits _{n\rightarrow \infty }\beta _{n}=\beta \). Write \(\gamma _{n}{:=}\frac{P(C)}{P(A)}\beta _{n}\) for \( n=1,2,... \); obviously \(\gamma _{n}\in ] 0,1[ \). We proceed to define inductively an expansive sequence \((A_{n})\) in \(\mathcal {A}\), with \( A_{n}\subseteq A\) and \(P(A_{n})=\beta _{n}P(C)\). For \(n=1\), by Theorem 5, there is \(A_{1}\in \mathcal {A}\), \(A_{1}\subseteq A\), such that \( P(A_{1})=\gamma _{1}P(A)=\beta _{1}P(C)\). For \(n=2\), by Theorem 5, there is \(\widetilde{A}_{2}\in \mathcal {A}\), \(\widetilde{A}_{2}\subseteq (A\setminus A_{1})\), such that \(P(\widetilde{A}_{2})=\frac{\gamma _{2}-\gamma _{1}}{1-\gamma _{1}}P(A{\setminus } A_{1})\); let \(A_{2}{:=}A_{1}\cup \widetilde{A}_{2}\). We have

$$\begin{aligned} P(A_{2})=\gamma _{1}P(A)+\frac{\gamma _{2}-\gamma _{1}}{1-\gamma _{1}} (P(A)-\gamma _{1}P(A))=\gamma _{2}P(A)=\beta _{2}P(C). \end{aligned}$$

Suppose now that \(A_{1},...,A_{n}\in \mathcal {A}\) are defined, such that \( A_{i-1}\subseteq A_{i}\subseteq A\) for \(i=2,...,n\) and \(P(A_{n})=\beta _{n}P(C)\). By Theorem 5, there is \(\widetilde{A}_{n+1}\in \mathcal {A}\), \(\widetilde{A}_{n+1}\subseteq (A{\setminus } A_{n})\), such that \(P(\widetilde{ A}_{n+1})=\frac{\gamma _{n+1}-\gamma _{n}}{1-\gamma _{n}}P(A{\setminus } A_{n})\); let \(A_{n+1}{:=}A_{n}\cup \widetilde{A}_{n+1}\). We have

$$\begin{aligned} P(A_{n+1})=\gamma _{n}P(A)+\frac{\gamma _{n+1}-\gamma _{n}}{1-\gamma _{n}} (P(A)-\gamma _{n}P(A))=\gamma _{n+1}P(A)=\beta _{n+1}P(C)\text {,} \end{aligned}$$

which shows that the expansive sequence \((A_{n})\) is defined as intended. Since \(\frac{P(A_{n})}{P(C)}=\beta _{n}\in \mathbb {Q}\), it follows from case (b) that \(P^{\prime }(A_{n})=\beta _{n}\) for \(n=1,2,...\) Therefore \( P^{\prime }(\bigcup \nolimits _{n=1}^{\infty }A_{n})=\lim \nolimits _{n\rightarrow \infty }\beta _{n}=\beta \). On the other hand,

$$\begin{aligned} P(\bigcup \limits _{n=1}^{\infty }A_{n})=(\lim \limits _{n\rightarrow \infty }\beta _{n})P(C)=\beta P(C)=P(A). \end{aligned}$$

Hence, from (A.A), we have \(P^{\prime }(A)=P^{\prime }(\bigcup \nolimits _{n=1}^{\infty }A_{n})\), and so \(P^{\prime }(A)=\beta =\frac{ P(A)}{P(C)}\).

Obviously (Example 2) the condition of P being non-atomic cannot be dropped in Theorem 7.

3 Bayesian parametric inference

In standard Bayesian parametric inference we consider a probability space \( (S\times \Theta ,\mathcal {B}_{n+k},P)\), where S is a Borel set in \(\mathbb {R} ^{n}\), \(\Theta \) is a (generalized) interval in \(\mathbb {R}^{k}\), \(\mathcal {B} _{n+k}\) is the Borel \(\sigma \)-algebra on \(S\times \Theta \) and P is a (\( \sigma \)-additive) probability measure. Here S is interpreted as the sample space where the response vector Y takes values and \(\Theta \) as the parameter space, each parameter \(\theta \) determining a probability distribution for Y. Recall that the marginal distributions \( P_{Y}\) and \(P_{\theta }\) are defined by \(P_{Y}(A){:=}P(A\times \Theta )\), \( P_{\theta }(B){:=}P(S\times B)\) for the corresponding Borelian sets A in S and B in \(\Theta \). In accordance to practice (see [4] and [3]; note that improper prior distributions are not being considered) we assume that in the parametric case \(P_{\theta }\) is non-atomic. We shall refer to \((S\times \Theta ,\mathcal {B}_{n+k},P)\) as the Bayesian parametric model.

For proofs of the following proposition see [1] or [5].

Proposition 8

Any atom of a Borel measure on a second countable Hausdorff space includes a singleton of positive measure.

Our last result is now immediate.

Proposition 9

Let \((S\times \Theta ,\mathcal {B}_{n+k},P)\) be the Bayesian parametric model. Then P is non-atomic.

Proof

By Proposition 8, if P had an atom, then it would include a singleton of positive measure, which contradicts that \(P_{\theta }\) is non-atomic.

If the Bayesian parametric model is considered as a valid formulation of a statistical problem (essentially, if \(S\times \Theta \) can be given a joint probability distribution), we conclude (taking (A.A) for granted) from Theorem 7 and Proposition  9 that (H.2) follows, and thus Bayes’ formula for the posterior distribution can be applied (provided that the measure-theoretic hypotheses for the suitable representation of the probability distributions hold; see for instance [6]).