As probability is a main tool when making all kind of decisions, it is important to trace back and know some stuff about how the idea and the concept of probability developed.
This post consists mainly of selected quotes from the book “Ten great ideas about chance”, written by two professors at Stanford – Diaconis and Skyrms, but there are some additional sentences, some of which from the other cited sources, and other explanations added. I strongly recommend the mentioned book , since it’s written in a very intriguing way by amazing authors!
Prior to Jacob Bernoulli and it’s “Ars Conjectandi” (written 1689), where he proved the weak law of large numbers (WLLN), the top mathematicians have not identified probability with frequency. For them (leaded by Laplace and De Morgan), probability was a form of rational degree or belief. What was frequency then and what was the connection between frequency and probability? The WLLN establishes such a relationship.
What Bernoulli proved was his golden theorem from which the WLLN follows. If you have given the chance of a random event E to happen (in an experiment) and some interval I for the frequency of happening of E, when repeating the experiment a number of times (making n trials), then Bernoulli derived an upper bound on the required number of trials, so that the observed frequency of E would lie in I with certain probability.
This is an inference problem from chances to frequencies, but not the inverse problem – from frequencies to chances. Yet, Bernoulli somehow convinced himself that he had solved the inverse inference problem?! How did he do so? He basically argued that since when we have large enough number of trials, the observed frequency would be (approximately) equal to chance. And since, the probability that these two quantities are not equal, is very small, then we can treat them as the same thing. But if frequency equals chance then chance equals probability, so the inverse problem is also solved. This is the Bernoulli’s swindle and it is a big fallacy, because if one tries to formalize it, he sees that the conditional probabilities go in different directions. The inverse problem was actually solved by Thomas Bayes. The mantra that we should identify relative frequencies and probabilities was repeated even in the 20th century, by distinguished probability theorist like Borel, Markov and Kolmogorov.
Very similar story with the hypothesis testing…There, for example, to test whether a drug is effective, given some data, one would naturally want to know what is the probability that the drug is effective, given the data. Instead, when confirming this hypothesis, we are saying that the probability of observing the given data, if the drug was not effective, is pretty small.! Almost like in the Bernoulli’s swindle…
John Venn was the first one to make a full-length exposition of the frequentist view in “The Logic of Chance” (1866), but he also wrestles with the exact formulation. For him, probability is the limit of the relative frequency as the number of trials goes to infinity. This idea has risen some new mathematical issues like the fact that there are sequences such that the limiting relative frequencies does not exist, but fluctuates and never approaches a limit…or another one is that limiting frequencies may not add properly (when for ex. you add infinite number of limiting frequency)…Venn’s theory appears to be full of holes, but it is to his credit that he saw most of them himself.
Richard Von Mises
Then, it came Richard Von Mises, who set out to put the theory of probability on a sound mathematical basis. The challenge, was of course raised by David Hilbert (his problem number 6) on the famous congress in Paris at 1900, where as special emphasis was given to the role of probability in statistical physics. Von Mises also interpreted probability as relative frequency, but just in a specific type of infinite sequences that are having additional properties. The first extra property was the existence of the limiting frequency (not very surprising)! But this was not enough…We want more – whenever the term probability is used it should relate to a (limit of a) frequency. Thus, we want also to know which sequences (with the frequency limits in them) can be associated to probabilities! And just the existence of a limiting frequency does not characterises the ‘good’ sequences. Here is one sequence that has a limiting frequency, but R. von Mises didn’t consider this sequence ‘good’: Imagine a coin outcome sequence where head always follows tail and tail always follows head. The limiting frequency of heads is 50%. Can we associate probability of 1/2 just to one such sequence, though?
One problem is that we can reorder the tosses in our sequence, so that it converges to any value in [0, 1] that we like. (If this is not obvious, consider how the relative frequency of even numbers among positive integers, which intuitively ‘should’ converge to 1/2, can instead be made to converge to 1/4 by reordering the integers with the even numbers in every fourth place, as follows: 1, 3, 5, 2, 7, 9, 11, 4, 13, 15, 17, 6, …). First of all, why should one ordering be privileged over others? A way to avoid looking at sequences suffering from this problem is to impose the requirement of randomness of the considered sequences, i.e. the relative frequencies should be invariant under selection of subsequences in some specified manner. In other words, in our special type of ‘good’ infinite sequences, the relative frequency of an event should be the same for any infinite subsequence that one might select. Von Mises called these ‘good’ sequences – Kollektivs and they have both properties – existence of limiting relative frequency and randomness.
Another part of the motivation of R. Von Mises’s for the randomness requirement is his understanding that any probability statement relates to an aggregate phenomenon, rather than to an individual isolated event, e.g. some sequence with a given fixed ordering without considering the other orders and respectively subsequences). Here is another quote from him:
“The probability of dying within the coming year may be for a certain individual if he is considered as a man of age 42, and it may be if he is considered as a man between 40 and 45, and if he is considered as belonging to the class of all men and women in the United States. In each case, the probability value is attached to the appropriate group of people, rather than to the individual”
In fact, with the randomness requirement, von Mises tries to capture the independence of successive tosses directly, without invoking the product rule! At this point a natural question may arise: why do we need Kollektivs at all? Why isn’t it sufficient to use the distribution (as in effect happens in Kolmogorov’s theory we will mention later) instead of the unwieldy formalism of Kollektivs? The answer is that Kollektivs are a necessary consequence of the frequency interpretation, in the sense that if one interprets probability as limiting relative frequency, then infinite series of outcomes will exhibit Kollektiv-like properties. Therefore, if one wants to axiomatise the frequency interpretation, these properties have to be built in.
Yet, the next logical questions is – what could be the allowed ways to select subsequences in a Kollektiv? Or equivalently – what is a truly random sequence?
For von Mises, the property we call randomness can be explicated – or even defined – by the impossibility of devising a successful system of gambling:
“A boy repeatedly tossing a dime supplied by the U. S. mint is quite sure that his chance of throwing “heads” is ½. But he knows more than that. He also knows that if, in tossing the coin under normal circumstances, he disregards the second, fourth, sixth, …, turns, his chance of “heads” among the outcomes of the remaining turns is still ½. He knows—or else he will soon learn—that in playing with his friends, he cannot improve his chance by selecting the turns in which he participates. His chance of “heads” remains unaltered even if he bets on “heads” only after “tails” has shown up three times in a row, etc. This particular feature of the sequence of experiments appearing here and in similar examples is called randomness. “
This is somewhat related to the comprehension of probability as a subjective quantity, i.e. not anything related to sequences,etc., but a degree of belief. Subjective probabilities are traditionally analyzed in terms of betting behavior. The reason is that if one should define a degree of belief, one good try is the following:
Your degree of belief in E is p iff p units of utility is the price at which you would buy or sell a bet that pays 1 unit of utility if E, 0 if not E.
But still, can we allow all possible subsequences, for which to want the limiting relative frequency to be the same? Apparently no, because for any given binary sequence, we can always the subsequence of those positions, where we have 1s. The relative frequency of the 1s in this particular subsequence won’t be the same (unless we don’t have any 0s in the initial sequence). This, we cannot really construct a Kollektiv explicitly, because if we have it, one can create such a ‘bad’ subsequence. To this objection, von Mises answered that Kollektivs are new mathematical objects, not constructible from previously defined objects, i.e. they are not to be thought of as numbers, i.e. known objects.
Interestingly, Richard Von Mises had a brother called Ludwig Von Mises, who proposed his own theory of probability (see ), but it didn’t became that popular.
There are important differences between Richard and Ludwig von Mises’s respective views about randomness, or “indeterminism.” Richard von Mises was heavily influenced by the work of Heisenberg, whose work was interpreted by Richard to have established the basic indeterminism of the world at both the macrophysical and microphysical levels. This view of the world as inherently indeterministic allows Richard to take the position that probabilities are objective “physical properties” of things in the world:
“The probability of a 6 is a physical property of a given die and is a property analogous to its mass, specific heat, or electrical resistance. Similarly, for a given pair of dice (including of course the total setup) the probability of a ‘double 6’ is a characteristic property, a physical constant belonging to the experiment as a whole and comparable with all its other physical properties”
Ludwig von Mises, on the other hand, does not follow his brother down this indeterministic road. In the first place, Ludwig was a determinist, who held that everything that occurs in the world has a prior cause.
Going back to the ideas of R. von Mises, he held that in order to understand Kollektives, one should always bear in mind the analogy with the idealized objects in geometry. Indeed, as the point, the line, the circle, etc. are just idealised objects, so are the Kollektives. For example, as in practice, you cannot have 2 points in the plane that are exactly at some given distance d, the same way you cannot point out a sequence that is an exact Kollektiv!
But, …, if such idealisations are permissible, then can we use even more idealised notion of probability that will make our life easier? Could we not just have objective chances – some idealised quantities assigned to any physical situation or an experiment, that shows what is the tendency (or ‘propensity’) of a certain outcome to be observed.? This is what is called now a ‘propensity view’ of probabilities. As Sir Karl Popper stated (he came up with a propensity theory, independently from Charles Pierce who was first), the outcome of a physical experiment is produced by a certain set of “generating conditions”. When we repeat an experiment, as the saying goes, we really perform another experiment with a (more or less) similar set of generating conditions. Thus, we may look at chances as quantities related to physical experiments, that have objective existence in the world.
Such a view would have needed a framework and here came the measure-theoretic framework of Kolmogorov. The main contribution of the framework is that it looks at random quantities (variables) as measurable functions, the theory of which was developed ~30 years prior to Kolmogorov, by Borel and Lebesgue. The book  cites some very interesting words of Mark Kac – a renown probabilitist, who said that back in 1933-34, he was wondering what exactly are the random quantities , that he read about in a work by A.Markov. Kac new everything about measure theory, but in the 1930s, people still hadn’t internalised the connection.
The mathematical object that Kolmogorov used to study probability is what we are all familiar from our undergrad classes. It is a triple:
related to an experiment, where is the set of possible outcomes, is a set of subsets of – those things that have probabilities and is a non-negative real-valued function on .
This triple is called probability space and and have some additional properties – is closed under taking unions, negations and intersections countable many times and is countably additive, with . So , the random quantities (or variables) are no longer mysterious objects – they are just measurable functions!
To modern eyes, Kolmogorov’s axioms look very simple, and one may well wonder why it took such a long time for probability theory to mature. One reason appears to be that probability was considered to be a branch of mathematical physics (this is how Hilbert presented it), so it was not immediately apparent which part of the real world should be incorporated in the axioms.
Another main contribution of Kolmogorov here, was that he properly formalizes conditional probability.
The Geneva conference
In 1937, University of Geneva organized a conference on the theory of probability where the focal point of the discussion was von Mises’ axiomatisation of probability theory, and especially its relation to the newly published axiomatisation by Kolmogorov. An excellent reading here is , where the arguments of the 2 sides are explained in much more details.
In summary, as a result of this conference, the framework of Kolmogorov was established as the standard framework when considering probabilities and a crucial role in the debate had an example by John Ville who showed that the law of the iterated logarithm cannot be derived via the theory of von Mises.
However, there are still people who are proponents of the frequentist view and it seems that taking a position in this debate is also a matter of philosophical preferences!
- P.Diaconis, B. Skyrms, “Ten great ideas about chance”
- M. Crovelli, A CHALLENGE TO LUDWIG VON MISES’S THEORY OF PROBABILITY, https://mises-media.s3.amazonaws.com/-2-23_2.pdf
- M. van Lambalgen, “RANDOMNESS AND FOUNDATIONS OF PROBABILITY: VON MISES’ AXIOMATISATION OF RANDOM SEQUENCES”, https://pdfs.semanticscholar.org/853a/5cdd7c2e443f898dca230d31ac4556970d76.pdf
- Hájek, Alan, “Interpretations of Probability”, The Stanford Encyclopedia of Philosophy, https://plato.stanford.edu/entries/probability-interpret/#CriAdeForIntPro