Spitzer’s Formula

Spitzer’s formula is a remarkable result giving the precise joint distribution of the maximum and terminal value of a random walk in terms of the marginal distributions of the process. I have already covered the use of the reflection principle to describe the maximum of Brownian motion, and the same technique can be used for simple symmetric random walks which have a step size of ±1. What is remarkable about Spitzer’s formula is that it applies to random walks with any step distribution.

We consider partial sums

$\displaystyle S_n=\sum_{k=1}^n X_k$

for an independent identically distributed (IID) sequence of real-valued random variables X₁, X₂, …. This ranges over index n = 0, 1, … starting at S₀ = 0 and has running maximum

$\displaystyle R_n=\max_{k=0,1,\ldots,n}S_k.$

Spitzer’s theorem is typically stated in terms of characteristic functions, giving the distributions of (R_n, S_n) in terms of the distributions of the positive and negative parts, S_n⁺ and S_n^–, of the random walk.

Theorem 1 (Spitzer) For α, β ∈ ℝ,

$\displaystyle \sum_{n=0}^\infty \phi_n(\alpha,\beta)t^n=\exp\left(\sum_{n=1}^\infty w_n(\alpha,\beta)\frac{t^n}{n}\right)$ (1)

where ϕ_n, w_n are the characteristic functions

$\displaystyle \begin{aligned} \phi_n(\alpha,\beta)&={\mathbb E}\left[e^{i\alpha R_n+i\beta(R_n-S_n)}\right],\\ w_n(\alpha,\beta)&={\mathbb E}\left[e^{i\alpha S_n^++i\beta S_n^-}\right]\\ &={\mathbb E}\left[e^{i\alpha S_n^+}\right]+{\mathbb E}\left[e^{i\beta S_n^-}\right]-1. \end{aligned}$

As characteristic functions are bounded by 1, the infinite sums in (1) converge for |t|< 1. However, convergence is not really necessary to interpret this formula, since both sides can be considered as formal power series in indeterminate t, with equality meaning that coefficients of powers of t are equated. Comparing powers of t gives

$\displaystyle \begin{aligned} \phi_1&=w_1,\\ \phi_2&=\frac{w_1^2}2+\frac{w_2}2,\\ \phi_3&=\frac{w_1^3}6+\frac{w_1w_2}2+\frac{w_3}3, \end{aligned}$

(2)

and so on.

Spitzer’s theorem in the form above describes the joint distribution of the nonnegative random variables (R_n, R_n - S_n) in terms of the nonnegative variables (S_n⁺, S_n^–). While this does have a nice symmetry, it is often more convenient to look at the distribution of (R_n, S_n) in terms of (S_n⁺, S_n), which is achieved by replacing α with α + β and β with –β in (1). This gives a slightly different, but equivalent, version of the theorem.

Theorem 2 (Spitzer) For α, β ∈ ℝ,

$\displaystyle \sum_{n=0}^\infty \tilde\phi_n(\alpha,\beta)t^n=\exp\left(\sum_{n=1}^\infty \tilde w_n(\alpha,\beta)\frac{t^n}{n}\right)$ (3)

where ϕ̃_n, w̃_n are the characteristic functions

$\displaystyle \begin{aligned} \tilde\phi_n(\alpha,\beta)&={\mathbb E}\left[e^{i\alpha R_n+i\beta S_n}\right],\\ \tilde w_n(\alpha,\beta)&={\mathbb E}\left[e^{i\alpha S_n^++i\beta S_n}\right]. \end{aligned}$

Taking β = 0 in either (1) or (3) gives the distribution of R_n in terms of S_n⁺,

$\displaystyle \sum_{n=0}^\infty {\mathbb E}\left[e^{i\alpha R_n}\right]t^n=\exp\left(\sum_{n=1}^\infty {\mathbb E}\left[e^{i\alpha S_n^+}\right]\frac{t^n}n\right)$

(4)

I will give a proof of Spitzer’s theorem below. First, though, let’s look at some consequences, starting with the following strikingly simple result for the expected maximum of a random walk.

Corollary 3 For each n ≥ 0,

$\displaystyle {\mathbb E}[R_n]=\sum_{k=1}^n\frac1k{\mathbb E}[S_k^+].$ (5)

Proof: As R_n ≥ S₁⁺ = X₁⁺, if X_k⁺ have infinite mean then both sides of (5) are infinite. On the other hand, if X_k⁺ have finite mean then so do S_n⁺ and R_n. Using the fact that the derivative of the characteristic function of an integrable random variable at 0 is just i times its expected value, compute the derivative of (4) at α = 0,

$\displaystyle \begin{aligned} \sum_{n=0}^\infty i{\mathbb E}[R_n]t^n &=\exp\left(\sum_{n=1}^\infty \frac{t^n}n\right)\sum_{n=1}^\infty i{\mathbb E}[S_n^+]\frac{t^n}n\\ &=\exp\left(-\log(1-t)\right)\sum_{n=1}^\infty i{\mathbb E}[S_n^+]\frac{t^n}n\\ &=(1-t)^{-1}\sum_{n=1}^\infty i{\mathbb E}[S_n^+]\frac{t^n}n\\ &=(1+t+t^2+\cdots)\sum_{n=1}^\infty i{\mathbb E}[S_n^+]\frac{t^n}n. \end{aligned}$

Equating powers of t gives the result. ⬜

The expression for the distribution of R_n in terms of S_n⁺ might not be entirely intuitive at first glance. Sure, it describes the characteristic functions and, hence, determines the distribution. However, we can describe it more explicitly. As suggested by the evaluation of the first few terms in (2), each ϕ_n is a convex combination of products of the w_n. As is well known, the characteristic function of the sum of random variables is equal to the product of their characteristic functions. Also, if we select a random variable at random from a finite set, then its characteristic function is a convex combination of those of the individual variables with coefficients corresponding to the probabilities in the random choice. So, (3) expresses the distribution of (R_n, S_n) as a random choice of sums of independent copies of (S_k⁺, S_k).

In fact, expressions such as (1,3) are common in many branches of maths, such as zeta functions associated with curves over finite fields. We have a power series which can be expressed in two different ways,

$\displaystyle \sum_{n=0}^\infty a_nt^n = \exp\left(\sum_{n=1}^\infty b_n\frac{t^n}n\right).$

The left hand side is the generating function of the sequence a_n. The right hand side is a kind of zeta function associated with the sequence b_n, and is sometimes referred to as the combinatorial zeta function. The logarithmic derivative gives Σ_nb_nt^n-1, which is the generating function of b_n+1. Continue reading “Spitzer’s Formula” →

Tomaszewski’s Conjecture

In a 1986 article of The American Mathematical Monthly written by Richard Guy, the following question was asked, and attributed to Bogusłav Tomaszewski: Consider n real numbers a₁, …, a_n such that Σ_ia_i² = 1. Of the 2ⁿ expressions |±a₁±⋯±a_n|,

can there be more with value > 1 than with value ≤ 1?

A cursory attempt to find such real numbers a_i where more of the absolute signed sums have value > 1 than have value ≤ 1 should be enough to convince you that it is, in fact, impossible. The answer was therefore expected to be no, it is not possible. This has claim since been known as Tomaszewski’s conjecture, and there have been many proofs of weaker versions over the years until, finally, in 2020, it was proved by Keller and Klein in the paper Proof of Tomaszewski’s Conjecture on Randomly Signed Sums.

An alternative formulation is in terms of Rademacher sums

$\displaystyle Z=a_1X_1+a_2X_2+\cdots+a_nX_n$

(1)

where X₁, …, X_n are independent ‘random signs’. That is, they have the Rademacher distribution ℙ(X_i = 1) = ℙ(X_i = -1) = 1/2. Then, Z has variance Σ_ia_i² and each of the 2ⁿ values ±a₁±⋯±a_n occurs with equal probability. So, Tomaszewski’s conjecture is the statement that

$\displaystyle {\mathbb P}(\lvert Z\rvert\le1)\ge{\mathbb P}(\lvert Z\rvert > 1)$

(2)

for unit variance Rademacher sums Z. It is usually stated in the equivalent, but more convenient form

$\displaystyle {\mathbb P}(\lvert Z\rvert\le1)\ge1/2.$

(3)

I will discuss Tomaszewski’s conjecture and the ideas central to the proof given by Keller and Klein. I will not give a full derivation here. That would get very tedious, as evidenced both by the length of the quoted paper and by its use of computer assistance. However, I will prove the ‘difficult’ cases, which makes use of the tricks essential to Keller and Klein’s proof, with all remaining cases being, in theory, provable by brute force. In particular, I give a reformulation of the inductive stopping time argument that they used. This is a very ingenious trick that was introduced by Keller and Klein, and describing this is one of the main motivations for this post. Another technique also used in the proof is based on the reflection principle, in addition to some tricks discussed in the earlier post on Rademacher concentration inequalities.

To get a feel for Rademacher sums, some simple examples are shown in figure 1. I use the notation a = (a₁, …, a_n) to represent the sequence with first n terms given by a_i, and any remaining terms equal to zero. The plots show the successive partial sums for each sequence of values of the random signs (X₁, X₂, …), with the dashed lines marking the ±1 levels.

The examples demonstrate that ℙ(|Z| ≤ 1) can achieve the bound of 1/2 in some cases, and be strictly more in others. The top-left and bottom-right plots show that, for certain coefficients, |Z| has a positive probability of being exactly equal to 1 and, furthermore, the claimed bound fails for ℙ(|Z|< 1). So, the inequality is optimal in a couple of ways. These examples concern a small number of nonzero coefficients. In the other extreme, for a large number of small coefficients, the central limit theorem says that Z is approximately a standard normal and ℙ(|Z| ≤ 1) is close to Φ(1) – Φ(-1) ≈ 0.68. Continue reading “Tomaszewski’s Conjecture” →

Rademacher Concentration Inequalities

Rademacher sum

Concentration inequalities place lower bounds on the probability of a random variable being close to a given value. Typically, they will state something along the lines that a variable Z is within a distance x of value μ with probability at least p,

$\displaystyle {\mathbb P}(\lvert Z-\mu\rvert\le x)\ge p.$

(1)

Although such statements can be made in more general topological spaces, I only consider real valued random variables here. Clearly, (1) is the same as saying that Z is greater than distance x from μ with probability no more than q = 1 – p. We can express concentration inequalities either way round, depending on what is convenient. Also, the inequality signs in expressions such as (1) may or may not be strict. A very simple example is Markov’s inequality,

$\displaystyle {\mathbb P}(\lvert Z\rvert\ge x)\le\frac{{\mathbb E}\lvert Z\rvert}{x}.$

In the other direction, we also encounter anti-concentration inequalities, which place lower bounds on on the probability of a random variable being at least some distance from a specified value, so take the form

$\displaystyle {\mathbb P}(\lvert Z-\mu\rvert\ge x)\ge p.$

(2)

An example is the Paley-Zygmund inequality,

$\displaystyle {\mathbb P}(\lvert Z\rvert > x)\ge\frac{({\mathbb E}\lvert Z\rvert-x)^2}{{\mathbb E}[Z^2]}$

which holds for all 0 ≤ x ≤ 𝔼|Z|.

While the examples given above of the Markov and Paley-Zygmund inequalities are very general, applying whenever the required moments exist, they are also rather weak. For restricted classes of random variables much stronger bounds can often be obtained. Here, I will be concerned with optimal concentration and anti-concentration bounds for Rademacher sums. Recall that these are of the form

$\displaystyle Z=a\cdot X=\sum_{n=1}^\infty a_nX_n$

for IID random variables X = (X₁, X₂, …) with the Rademacher distribution, ℙ(X_n = 1) = ℙ(X_n = -1) = 1/2, and a = (a₁, a₂, …) is a square-summable sequence. This sum converges to a limit with zero mean and variance

$\displaystyle {\mathbb E}[Z^2]=\lVert a\rVert_2^2=\sum_na_n^2.$

I discussed such sums at length in the posts on Rademacher series and the Khintchine inequality, and have been planning on making this follow-up post ever since. In fact, the L⁰ Khintchine inequality was effectively the same thing as an anti-concentration bound. It was far from optimal as presented there, and relied on the rather inefficient Paley-Zygmund inequality for the proof. Recently, though, a paper was posted on arXiv claiming to confirm conjectured optimal anti-concentration bounds which I had previous mentioned on mathoverflow. See Tight lower bounds for anti-concentration of Rademacher sums and Tomaszewski’s counterpart problem by Lawrence Hollom and Julien Portier.

While the form of the tight Rademacher concentration and anti-concentration bounds may seem surprising at first, being piecewise constant and jumping between rather arbitrary looking rational values at seemingly arbitrary points, I will explain why this is so. It is actually rather interesting and has been a source of conjectures over the past few decades, some of which have now been proved and some which remain open. Actually, as I will explain, many tight bounds can be proven in principle by direct computation, although it would be rather numerically intensive to perform in practice. In fact, some recent results — including those of Hollom and Portier mentioned above — were solved with the aid of a computer to perform the numerical legwork.

Anti-Concentration Bounds

For a Rademacher sum Z of unit variance, recall from the post on the Khintchine inequality that the anti-concentration bound

$\displaystyle {\mathbb P}(\lvert Z\rvert\ge x)\ge(1-x^2)^2/3.$

(3)

holds for all non-negative x ≤ 1. This followed from Payley-Zygmund together with the simple Khintchine inequality 𝔼[Z²] ≤ 3. However, this is sub-optimal and is especially bad in the limit as x increases to 1 where the bound tends to zero whereas, as we will see, the optimal bound remains strictly positive.

In the other direction if, for positive integer n, we choose coefficients a ∈ ℓ² with a_k = 1/√n for k ≤ n and zero elsewhere then, by the central limit theorem, Z = a·X tends to a standard normal distribution. as n becomes large. Hence

$\displaystyle {\mathbb P}(\lvert Z\rvert \ge x)\rightarrow2\Phi(-x)$

where Φ is the cumulative normal distribution function. So, any anti-concentration bound must be no more than this.

The optimal anti-concentration bounds have been open conjectures for a while, but are now proved and described by theorem 1 below, as plotted in figure 1. They are given by a piecewise constant function and, as clear from the plot, lie strictly between the simple Paley-Zymund bound and Gaussian probabilities.

Theorem 1 The optimal lower bound p for the inequality ℙ(|Z| ≥ x) ≥ p for Rademacher sums Z of unit variance is,

$\displaystyle p=\begin{cases} 1,&{\rm for\ }x=0,\\ 1/2,&{\rm for\ }0 < x\le1/\sqrt7,\\ 29/64,&{\rm for\ }1/\sqrt7 < x \le1/\sqrt5,\\ 3/8,&{\rm for\ }1/\sqrt5 < x\le1/\sqrt3,\\ 1/4,&{\rm for\ }1/\sqrt3 < x\le2/\sqrt6,\\ 7/32,&{\rm for\ }2/\sqrt6 < x\le 1,\\ 0,&{\rm for\ }1 < x. \end{cases}$

At first sight, this result might seem a little strange. Why do the optimal bounds take this discrete set of values, and why does it jump at these arbitrary seeming values of x? To answer that, consider the distribution of a Rademacher sum. When all coefficients are small it approximates a standard normal and the anti-concentration probabilities approach those indicated by the ‘Gaussian bound’ in figure 1. However, these are not optimal, and the minimal probabilities are obtained at the opposite extreme with a small number n of relatively large coefficients and the remaining being zero. In this case, the distribution is finite with probabilities being multiples of 2^–n, and the bound jumps when x passes through the discrete levels.

The values of a ∈ ℓ² for which the stated bounds are achieved are not hard to find. For convenience, I use (a₁, a₂, …, a_n) to represent the sequence starting with the stated values and remaining terms being zero, a_k = 0 for k > n. Also, if c is a numeral than c_k will denote repeating this value k times.

Lemma 2 The optimal lower bound stated by theorem 1 for Rademacher sum Z = a·X is achieved with

$\displaystyle a=\begin{cases} (1),&{\rm for\ }x=0,\\ (1_2)/\sqrt2,&{\rm for\ }0 < x\le1/\sqrt7,\\ (1_7)/\sqrt7,&{\rm for\ }1/\sqrt7 < x\le1/\sqrt5,\\ (1_5)/\sqrt5,&{\rm for\ }1/\sqrt5 < x\le1/\sqrt3,\\ (1_3)/\sqrt3,&{\rm for\ }1/\sqrt3 < x\le2/\sqrt6,\\ (1_6)/\sqrt6,&{\rm for\ }2/\sqrt6 < x\le1,\\ (1),&{\rm for\ }1 < x.\\ \end{cases}$

This is straightforward to verify by simply counting the number of sign values of (X₁, |, X_n) for which |a·X| ≥ x and multiplying by 2^–n. It does however show that it is impossible to do better than theorem 1 so that, if the bounds hold, they must be optimal. Also, as ℙ(|Z| ≥ x) is decreasing in x, to establish the result it is sufficient to show that the bounds hold at the values of x where it jumps. This reduces theorem 1 to the following finite set of inequalities.

Theorem 3 A Rademacher sum Z of unit variance satisfies,

$\displaystyle \begin{aligned} &{\mathbb P}(\lvert Z\rvert\ge1/\sqrt7)\ge1/2,\\ &{\mathbb P}(\lvert Z\rvert\ge1/\sqrt5)\ge29/64,\\ &{\mathbb P}(\lvert Z\rvert\ge1/\sqrt3)\ge3/8,\\ &{\mathbb P}(\lvert Z\rvert\ge2/\sqrt6)\ge1/4,\\ &{\mathbb P}(\lvert Z\rvert\ge1)\ge7/32. \end{aligned}$

The last of these has been an open conjecture for years since it was mentioned in a 1996 paper by Oleszkiewicz. I asked about the first one in a 2021 mathoverflow question, and also mentioned the finite set of values x at which the optimal bound jumps, hinting at the full set of inequalities, which was the desired goal. Finally, in 2023, a preprint appeared on arXiv claiming to prove all of these. While I have not completely verified the proof and the computer programs used myself, it does look likely to be correct.

Although the bounds given above are all for anti-concentration about 0, a simple trick shows that they will also hold about any real value.

Lemma 4 Suppose that the inequality ℙ(|Z| ≥ x) ≥ p holds for all Rademacher sums Z of unit variance. Then the anti-concentration bound

$\displaystyle {\mathbb P}\left(\lvert Z-\mu\rvert\ge x\right)\ge p$

also holds about every value μ.

Proof: If Z = a·X and, independently, Y is a Rademacher random variable, then Z + μY is a Rademacher sum of variance 1 + μ². This follows from the fact that it has the same distribution as b·X where b = (μ, a₁, a₂, …). So,

$\displaystyle \begin{aligned} p &\le{\mathbb P}\left(\lvert Z+\mu Y\rvert/\sqrt{1+\mu^2}\ge x\right)\\ &=\frac12{\mathbb P}\left(\lvert Z-\mu\rvert\ge x\sqrt{1+\mu^2}\right)+\frac12{\mathbb P}\left(\lvert Z+\mu\rvert\ge x\sqrt{1+\mu^2}\right) \end{aligned}$

By symmetry of the distribution of Z, both probabilities on the right are equal giving,

$\displaystyle {\mathbb P}\left(\lvert Z-\mu\rvert\ge x\sqrt{1+\mu^2}\right)\ge p$

implying the result. ⬜

You have probably noted already, that all of the coefficients described in lemma 2 are of the form (1_n)/√n. That is, they are a finite a sequence of equal values with the remaining terms being zero. In fact, it has been conjectured that all optimal concentration and anti-concentration bounds (about 0) for Rademacher sums can be attained in this way. This is attributed to Edelman dating back to private communications in 1991. If true, it would turn the process of finding and proving such optimal bounds into a straightforward calculation but, unfortunately, in 2012, the conjecture was shown to be false by Pinelis for some concentration inequalities.

Before moving on, let’s mention how such bounds can be discovered in the first place. Running a computer simulation for randomly chosen finite sequences of coefficients very quickly converges on the optimal values. As soon as we randomly select values close to those given by lemma 2, we obtain the exact bounds and any further simulations only serve to verify that these hold. Running additional simulations with coefficients chosen randomly in the neighbourhood of points where the bound is attained, and near certain ‘critical’ values where the bound looks close to being broken, further strengthens the belief that they are indeed optimal. At least, this is how I originally found them before asking the mathoverflow question, although this is still far from a proof.

Concentration Bounds

Continue reading “Rademacher Concentration Inequalities” →

Non-Measurable Sets

Probability and measure theory relies on the concept of measurable sets. On the real numbers ℝ, in particular, there are several different sigma-algebras which are commonly used, and a set is said to be measurable if it lies in the one under consideration. Probabilities and measures are only defined for events lying in a specific sigma-algebra, so it is essential to know if sets are measurable. Fortunately, most simply constructed events will indeed be measurable, but this is not always the case. In fact, once we start working with more complex setups, such as continuous-time stochastic processes observed at random times, non-measurable sets occur more commonly than might be expected. To avoid such issues, it is usual to enlarge the underlying sigma-algebra defining a probability space as much as possible.

The Borel sets form the smallest sigma-algebra containing the open sets or, equivalently, containing all intervals. This is denoted as ${\mathcal B({\mathbb R})}$ , which I will also shorten to ${\mathcal B}$ . An explicit construction of a non-Borel set was given by Lusin in 1927. Every irrational real number can be expressed uniquely as a continued fraction

$\displaystyle x = a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{\ddots\,}}}}$

where a₀ is an integer and a_i are positive integers for i ≥ 1. Lusin considered the set of irrationals whose continued fraction coefficients contain a subsequence a_k₁, a_k₂, … such that each term is a divisor of the subsequent term.

Other examples can be given along similar lines to Lusin’s. Every real number has a binary expansion

$\displaystyle x=a_0.a_1a_2a_3\ldots$

where a₀ is an integer and a_i is in {0, 1} for each i ≥ 1. Consider the set of reals having a binary expansion for which there is an infinite sequence of positive integers k₁, k₂, …, each term strictly dividing the next, such that a_k_i = 1. I will give proofs that these examples are non-Borel in this post.

There is a general method of enlarging sigma-algebras known as the completion. Consider a measure μ defined on a measurable space ${(X,\mathcal E)}$ consisting of sigma-algebra ${\mathcal E}$ on set X. The completion ${\mathcal E_\mu}$ consists of all subsets S ⊆ X which can be sandwiched between sets in ${\mathcal E}$ whose difference has zero measure. That is, A ⊆ S ⊆ B for ${ A,B\in\mathcal E}$ with μ(B ∖ A) = 0. It can be shown that ${\mathcal E_\mu}$ is a sigma-algebra containing ${\mathcal E}$ , and μ uniquely extends to a measure on this by taking μ(S) = μ(A) = μ(B) for S, A, B as above.

The Lebesgue measure λ is uniquely defined on the Borel sets by specifiying its value on intervals as λ((a, b)) = b – a. The completion ${\mathcal B_\lambda}$ is the Lebesgue sigma-algebra, which I will denote by ${\mathcal L}$ . Usually, when saying that a subset of the reals is measurable without further qualification, it is understood to mean that it is in ${\mathcal L}$ . The non-Borel set constructed by Lusin can be shown to be measurable (in fact, its complement has zero Lebesgue measure).

While the Lebesgue measure extends uniquely to ${\mathcal L}$ , this is not true for for all measures defined on the Borel sigma-algebra. In particular, it will not be true for singular measures, which assign a positive value to some Lebesgue-null sets. An example is the uniform probability measure (or, Haar measure) on the Cantor middle-thirds set C. This has zero Lebesgue measure, so every subset of C is in ${\mathcal L}$ , but the uniform measure on C cannot be extended uniquely to all subsets. For this reason, universal completions are often used. For a measurable space ${(X,\mathcal E)}$ , the universal completion ${\bar{\mathcal E}}$ consists of the subsets of X which lie in the completion of ${\mathcal E}$ with respect to every possible sigma-finite measure.

$\displaystyle \bar{\mathcal E}=\bigcap_\mu\mathcal E_\mu.$

The intersection is taken over all sigma-finite measures μ on ${\mathcal E}$ . It is enough to take the intersection over finite or, even, probability measures, since every sigma-finite measure is equivalent to such. The universal completion is a bit tricky to understand in an explicit way but, by construction, all sigma-finite measures on a sigma-algebra extend uniquely to its universal completion. It can be shown that Lusin’s example of a non-Borel set does lie in the universal completion ${\bar{\mathcal B}}$ .

Finally, the power set ${\mathcal P(X)}$ consisting of all subsets of a set X is a sigma-algebra. For uncountable sets such as the reals, this is often too large to be of use and measures cannot be extended in any unique way. However, we have four common sigma-algebras of the real numbers,

$\displaystyle \mathcal B\subseteq\bar{\mathcal B}\subseteq\mathcal L\subseteq\mathcal P.$

(1)

In this post, I show that each of these inclusions is strict. That is, there are subsets of the reals which are not Lebesgue measurable, there are Lebesgue sets which are not in the universal completion ${\bar{\mathcal B}}$ , and there are sets in ${\bar{\mathcal B}}$ which are not Borel. Lusin’s construction is an example of the latter. The strictness of the other two inclusions does depend crucially on the axiom of choice so, unlike Lusin’s set, the examples demonstrating that these are strict are not explicit. Continue reading “Non-Measurable Sets” →

Probability Space Extensions and Relative Products

According to Kolmogorov’s axioms, to define a probability space we start with a set Ω and an event space consisting of a sigma-algebra F on Ω. A probability measure ℙ on this gives the probability space (Ω, F , ℙ), on which we can define random variables as measurable maps from Ω to the reals or other measurable space.

However, it is common practice to suppress explicit mention of the underlying sample space Ω. The values of a random variable X: Ω → ℝ are simply written as X, rather than X(ω) for ω ∈ Ω. It is intuitively thought of as a real number which happens to be random, rather than a function. For one thing, we usually do not really care what the sample space is and, instead, only care about events and their probabilities, and about random variables and their expectations. This philosophy has some benefits. Frequently, when performing constructions, it can be useful to introduce new supplementary random variables to work with. It may be necessary to enlarge the sample space and add new events to the sigma-algebra to accommodate these. If the underlying space is not set in stone then this is straightforward to do, and we can continue to work with these new variables as if they were always there from the start.

Definition 1 An extension π of a probability space (Ω, F , ℙ) to a new space (Ω′, F ′, ℙ′),

$\displaystyle \pi\colon(\Omega',\mathcal F',{\mathbb P}')\rightarrow(\Omega,\mathcal F,{\mathbb P}),$

is a probability preserving measurable map π: Ω′ → Ω. That is, ℙ′(π^-1E) = ℙ(E) for events E ∈ F .

By construction, events E ∈ F pull back to events π^-1E ∈ F ′ with the same probabilities. Random variables X defined on (Ω, F , ℙ) lift to variables π^∗X with the same distribution defined on (Ω′, F ′, ℙ′), given by π^∗X(ω) ≡ X(π(ω)). I will use the notation X^∗ in place of π^∗X for brevity although, in applications, it is common to reuse the same symbol X and simply note that we are now working with respect to an enlarged the probability space if necessary.

$\displaystyle \arraycolsep=4pt\begin{array}{rcl} \Omega'&\xrightarrow{\displaystyle\ \pi\ }&\Omega\medskip\\ & \hspace{-2em}{}_{{}_{\displaystyle X^*}}\hspace{-0.6em}\searrow&\Big\downarrow X\medskip\\ &&\,{\mathbb R} \end{array}$

The extension can be thought of in two steps. First, the enlargement of the sample space, π: Ω′ → Ω on which we induce the sigma algebra π^∗F consisting of events π^-1E for E ∈ F , and the measure ℙ′(π^-1E) = ℙ(E). This is essentially a no-op, since events and random variables on the initial space are in one-to-one correspondence with those on the enlarged space (at least, up to zero probability events). Next, we enlarge the sigma-algebra to F ′ ⊇ π^∗F and extend the measure ℙ′ to this. It is this second step which introduces new events and random variables.

Since we may want to extend a probability space more than a single time, I look at how these combine. Consider an extension π of the original probability space, and then a further extension ρ of this.

$\displaystyle (\Omega'',\mathcal F'',{\mathbb P}'')\xrightarrow{\rho} (\Omega',\mathcal F',{\mathbb P}')\xrightarrow{\pi} (\Omega,\mathcal F,{\mathbb P}).$

These can be combined into a single extension ϕ = π○ρ of the original space,

$\displaystyle \phi\colon(\Omega'',\mathcal F'')\rightarrow(\Omega,\mathcal F,{\mathbb P}).$

Lemma 2 The composition ϕ = π○ρ is itself an extension of the probability space.

Proof: As compositions of measurable maps are measurable, it is sufficient to check that ϕ preserves probabilities. This is straightforward,

$\displaystyle {\mathbb P}''(\phi^{-1}E)={\mathbb P}''(\rho^{-1}\pi^{-1}E)={\mathbb P}'(\pi^{-1}E)={\mathbb P}(E)$

for all E ∈ F . ⬜

So far, so simple. The main purpose of this post, however, is to look at the situation with two separate extensions of the same underlying space. Both of these will add in some additional source of randomness, and we would like to combine them into a single extension.

Separate probability spaces can be combined by the product measure, which is the measure on the product space for which the projections onto the original spaces preserves probability, and for which the sigma-algebras generated by these projections are independent. Recall that a pair of sigma-algebras F and G defined on a probability space are independent if, for any sets A ∈ F and B ∈ G then ℙ(A ∩ B) = ℙ(A)ℙ(B).

Combining extensions of probability spaces will, instead, make use of relative independence.

Definition 3 Let (Ω, F , ℙ) be a probability space. Two sub-sigma-algebras G , H ⊆ F are relatively independent over a third sigma-algebra K ⊆ G ∩ H if

$\displaystyle {\mathbb P}(A\cap B) = {\mathbb E}\left[{\mathbb P}(A\vert\mathcal K){\mathbb P}(B\vert\mathcal K)\right]$ (1)

for all A ∈ G and B ∈ H .

It can be shown that the following properties are each equivalent to this definition;

𝔼[XY|K ] = 𝔼[X|K ]𝔼[Y|K ] for all bounded G -measurable random variables X and H -measurable Y.
𝔼[X|G ] = 𝔼[X|K ] for all bounded H -measurable X.
𝔼[X|H ] = 𝔼[X|K ] for all bounded G -measurable X.

Once a probability measure is specified separately on G and H then its extension to the sigma-algebra generated by G ∪ H , if it exists, is uniquely determined by relative independence. This is a consequence of the pi-system lemma, since (1) defines it on the events {A ∩ B: A ∈ G , B ∈ H }, which is a pi-system generating the same sigma-algebra.

Now consider two separate extensions π₁ and π₂ of the same underlying probability space,

$\displaystyle (\Omega_1,\mathcal F^1,{\mathbb P}_1)\xrightarrow{\pi_1} (\Omega,\mathcal F,{\mathbb P})\xleftarrow{\pi_2} (\Omega_2,\mathcal F^2,{\mathbb P}_2)$

As maps between sets, these can both be embedded into a single extension known as the pullback or fiber product. This is the set Ω′= Ω₁ ×_Ω Ω₂ defined by

$\displaystyle \Omega_1\times_{\Omega}\Omega_2=\left\{(\omega_1,\omega_2)\in\Omega_1\times\Omega_2\colon\pi_1(\omega_1)=\pi_2(\omega_2)\right\}.$

Defining projection maps ρ_i: Ω′ → Ω_i by

$\displaystyle \rho_1(\omega_1,\omega_2)=\omega_1,\ \rho_2(\omega_1,\omega_2)=\omega_2$

results in a commutative square with ϕ ≡ π₁○ρ₁ = π₂○ρ₂,

$\displaystyle \arraycolsep=1.4pt\begin{array}{rcl} \Omega'\ &\xrightarrow{\displaystyle\ \rho_1\ }&\Omega_1\medskip\\ {\rho_2}\Big\downarrow\,\ &\searrow^{\hspace{-0.3em}\displaystyle\phi}&\,\Big\downarrow{\pi_1}\medskip\\ \Omega_2\,&\xrightarrow{\displaystyle\ \pi_2\ }&\,\Omega \end{array}$

In fact, Ω′ is exactly the cartesian product Ω₁ × Ω₂ restricted to the subset on which π₁○ρ₁ and π₂○ρ₂ agree.

This constructs an extension ϕ of the sample space containing π₁ and π₂ as sub-extensions. However, it still needs to be made into a probability space. Use the smallest sigma-algebra F ′ on Ω′ making ρ₁, ρ₂ into measurable maps, which is generated by ρ₁^∗F ¹ ∪ ρ₂^∗F ². The probability measure ℙ′ on (Ω′, F ′) is uniquely determined on each of the sub-sigma-algebras by the requirement that ρ_i preserve probabilities,

$\displaystyle {\mathbb P}'(\rho_i^{-1}A)={\mathbb P}_i(A)$

for i = 1, 2 and A ∈ F ⁱ. These necessarily agree on ϕ^∗F ⊆ ρ₁^∗F ¹ ∩ ρ₂^∗F ²,

$\displaystyle {\mathbb P}'(\phi^*A)={\mathbb P}'(\rho_i^{-1}\pi_i^{-1}A)={\mathbb P}_i(\pi_i^{-1}A)={\mathbb P}(A)$

for A ∈ F . The natural way to extend ℙ′ to all of F ′ is to use relative independence over ϕ^∗F .

Definition 4 The relative product of the extensions π₁ and π₂ is the extension

$\displaystyle \phi\colon(\Omega',\mathcal F',{\mathbb P}')\rightarrow(\Omega,\mathcal F,{\mathbb P})$

with ϕ, Ω′, F ′ constructed as above, and ℙ′ is the unique probability measure for which the projections ρ₁, ρ₂ preserve probabilities, and for which ρ₁^∗F ¹ and ρ₂^∗F ² are relatively independent over ϕ^∗F .

Continue reading “Probability Space Extensions and Relative Products” →

Independence of Normals

A well known fact about joint normally distributed random variables, is that they are independent if and only if their covariance is zero. In one direction, this statement is trivial. Any independent pair of random variables has zero covariance (assuming that they are integrable, so that the covariance has a well-defined value). The strength of the statement is in the other direction. Knowing the value of the covariance does not tell us a lot about the joint distribution so, in the case that they are joint normal, the fact that we can determine independence from this is a rather strong statement.

Theorem 1 A joint normal pair of random variables are independent if and only if their covariance is zero.

Proof: Suppose that X,Y are joint normal, such that ${X\overset d= N(\mu_X,\sigma^2_X)}$ and ${Y\overset d=N(\mu_Y,\sigma_Y^2)}$ , and that their covariance is c. Then, the characteristic function of ${(X,Y)}$ can be computed as

$\displaystyle \begin{aligned} {\mathbb E}\left[e^{iaX+ibY}\right] &=e^{ia\mu_X+ib\mu_Y-\frac12(a^2\sigma_X^2+2abc+b^2\sigma_Y^2)}\\ &=e^{-abc}{\mathbb E}\left[e^{iaX}\right]{\mathbb E}\left[e^{ibY}\right] \end{aligned}$

for all ${(a,b)\in{\mathbb R}^2}$ . It is standard that the joint characteristic function of a pair of random variables is equal to the product of their characteristic functions if and only if they are independent which, in this case, corresponds to the covariance c being zero. ⬜

To demonstrate necessity of the joint normality condition, consider the example from the previous post.

Example 1 A pair of standard normal random variables X,Y which have zero covariance, but ${X+Y}$ is not normal.

As their sum is not normal, X and Y cannot be independent. This example was constructed by setting ${Y={\rm sgn}(\lvert X\rvert -K)X}$ for some fixed ${K > 0}$ , which is standard normal whenever X is. As explained in the previous post, the intermediate value theorem ensures that there is a unique value for K making the covariance ${{\mathbb E}[XY]}$ equal to zero. Continue reading “Independence of Normals” →

Multivariate Normal Distributions

I looked at normal random variables in an earlier post but, what does it mean for a sequence of real-valued random variables ${X_1,X_2,\ldots,X_n}$ to be jointly normal? We could simply require each of them to be normal, but this says very little about their joint distribution and is not much help in handling expressions involving more than one of the ${X_i}$ at once. In case that the random variables are independent, the following result is a very useful property of the normal distribution. All random variables in this post will be real-valued, except where stated otherwise, and we assume that they are defined with respect to some underlying probability space ${(\Omega,\mathcal F,{\mathbb P})}$ .

Lemma 1 Linear combinations of independent normal random variables are again normal.

Proof: More precisely, if ${X_1,\ldots,X_n}$ is a sequence of independent normal random variables and ${a_1,\ldots,a_n}$ are real numbers, then ${Y=a_1X_1+\cdots+a_nX_n}$ is normal. Let us suppose that ${X_k}$ has mean ${\mu_k}$ and variance ${\sigma_k^2}$ . Then, the characteristic function of Y can be computed using the independence property and the characteristic functions of the individual normals,

$\displaystyle \begin{aligned} {\mathbb E}\left[e^{i\lambda Y}\right] &={\mathbb E}\left[\prod_ke^{i\lambda a_k X_k}\right] =\prod_k{\mathbb E}\left[e^{i\lambda a_k X_k}\right]\\ &=\prod_ke^{-\frac12\lambda^2a_k^2\sigma_k^2+i\lambda a_k\mu_k} =e^{-\frac12\lambda^2\sigma^2+i\lambda\mu} \end{aligned}$

where we have set ${\mu_k=\sum_ka_k\mu_k}$ and ${\sigma^2=\sum_ka_k^2\sigma_k^2}$ . This is the characteristic function of a normal random variable with mean ${\mu}$ and variance ${\sigma^2}$ . ⬜

The definition of joint normal random variables will include the case of independent normals, so that any linear combination is also normal. We use use this result as the defining property for the general multivariate normal case.

Definition 2 A collection ${\{X_i\}_{i\in I}}$ of real-valued random variables is multivariate normal (or joint normal) if and only if all of its finite linear combinations are normal.

Continue reading “Multivariate Normal Distributions” →

The Riemann Zeta Function and Probability Distributions

Phi and Psi densities — Figure 1: Probability densities used to extend the zeta function

The famous Riemann zeta function was first introduced by Riemann in order to describe the distribution of the prime numbers. It is defined by the infinite sum

$\displaystyle \begin{aligned} \zeta(s) &=1+2^{-s}+3^{-s}+4^{-s}+\cdots\\ &=\sum_{n=1}^\infty n^{-s}, \end{aligned}$

(1)

which is absolutely convergent for all complex s with real part greater than one. One of the first properties of this is that, as shown by Riemann, it extends to an analytic function on the entire complex plane, other than a simple pole at ${s=1}$ . By the theory of analytic continuation this extension is necessarily unique, so the importance of the result lies in showing that an extension exists. One way of doing this is to find an alternative expression for the zeta function which is well defined everywhere. For example, it can be expressed as an absolutely convergent integral, as performed by Riemann himself in his original 1859 paper on the subject. This leads to an explicit expression for the zeta function, scaled by an analytic prefactor, as the integral of ${x^s}$ multiplied by a function of x over the range ${ x > 0}$ . In fact, this can be done in a way such that the function of x is a probability density function, and hence expresses the Riemann zeta function over the entire complex plane in terms of the generating function ${{\mathbb E}[X^s]}$ of a positive random variable X. The probability distributions involved here are not the standard ones taught to students of probability theory, so may be new to many people. Although these distributions are intimately related to the Riemann zeta function they also, intriguingly, turn up in seemingly unrelated contexts involving Brownian motion.

In this post, I derive two probability distributions related to the extension of the Riemann zeta function, and describe some of their properties. I also show how they can be constructed as the sum of a sequence of gamma distributed random variables. For motivation, some examples are given of where they show up in apparently unrelated areas of probability theory, although I do not give proofs of these statements here. For more information, see the 2001 paper Probability laws related to the Jacobi theta and Riemann zeta functions, and Brownian excursions by Biane, Pitman, and Yor. Continue reading “The Riemann Zeta Function and Probability Distributions” →

Manipulating the Normal Distribution

The normal (or Gaussian) distribution is ubiquitous throughout probability theory for various reasons, including the central limit theorem, the fact that it is realistic for many practical applications, and because it satisfies nice properties making it amenable to mathematical manipulation. It is, therefore, one of the first continuous distributions that students encounter at school. As such, it is not something that I have spent much time discussing on this blog, which is usually concerned with more advanced topics. However, there are many nice properties and methods that can be performed with normal distributions, greatly simplifying the manipulation of expressions in which it is involved. While it is usually possible to ignore these, and instead just substitute in the density function and manipulate the resulting integrals, that approach can get very messy. So, I will describe some of the basic results and ideas that I use frequently.

Throughout, I assume the existence of an underlying probability space ${(\Omega,\mathcal F,{\mathbb P})}$ . Recall that a real-valued random variable X has the standard normal distribution if it has a probability density function given by,

$\displaystyle \varphi(x)=\frac1{\sqrt{2\pi}}e^{-\frac{x^2}2}.$

For it to function as a probability density, it is necessary that it integrates to one. While it is not obvious that the normalization factor ${1/\sqrt{2\pi}}$ is the correct value for this to be true, it is the one fact that I state here without proof. Wikipedia does list a couple of proofs, which can be referred to. By symmetry, ${-X}$ and ${X}$ have the same distribution, so that they have the same mean and, therefore, ${{\mathbb E}[X]=0}$ .

The derivative of the density function satisfies the useful identity

$\displaystyle \varphi^\prime(x)=-x\varphi(x).$

(1)

This allows us to quickly verify that standard normal variables have unit variance, by an application of integration by parts.

$\displaystyle \begin{aligned} {\mathbb E}[X^2] &=\int x^2\varphi(x)dx\\ &= -\int x\varphi^\prime(x)dx\\ &=\int\varphi(x)dx-[x\varphi(x)]_{-\infty}^\infty=1 \end{aligned}$

Another identity satisfied by the normal density function is,

$\displaystyle \varphi(x+y)=e^{-xy - \frac{y^2}2}\varphi(x)$

(2)

This enables us to prove the following very useful result. In fact, it is difficult to overstate how helpful this result can be. I make use of it frequently when manipulating expressions involving normal variables, as it significantly simplifies the calculations. It is also easy to remember, and simple to derive if needed.

Theorem 1 Let X be standard normal and ${f\colon{\mathbb R}\rightarrow{\mathbb R}_+}$ be measurable. Then, for all ${\lambda\in{\mathbb R}}$ ,

$\displaystyle \begin{aligned} {\mathbb E}[e^{\lambda X}f(X)] &={\mathbb E}[e^{\lambda X}]{\mathbb E}[f(X+\lambda)]\\ &=e^{\frac{\lambda^2}{2}}{\mathbb E}[f(X+\lambda)]. \end{aligned}$ (3)

Continue reading “Manipulating the Normal Distribution” →

Quantum Coin Tossing

coinflip

Let me ask the following very simple question. Suppose that I toss a pair of identical coins at the same time, then what is the probability of them both coming up heads? There is no catch here, both coins are fair. There are three possible outcomes, both tails, one head and one tail, and both heads. Assuming that it is completely random so that all outcomes are equally likely, then we could argue that each possibility has a one in three chance of occurring, so that the answer to the question is that the probability is 1/3.

Of course, this is wrong! A fair coin has a probability of 1/2 of showing heads and, by independence, standard probability theory says that we should multiply these together for each coin to get the correct answer of ${\frac12\times\frac12=\frac14}$ , which can be verified by experiment. Alternatively, we can note that the outcome of one tail and one head, in reality, consists of two equally likely possibilities. Either the first coin can be a head and the second a tail, or vice-versa. So, there are actually four equally likely possible outcomes, only one of which has both coins showing heads, again giving a probability of 1/4. Continue reading “Quantum Coin Tossing” →

	Anonymous on Poisson Processes
	Anonymous on About
	Anonymous on About
	Anonymous on About
	Anonymous on The Projection Theorems
	Anonymous on Feller Processes
	SilverBladeII on Cadlag Modifications
	Anonymous on Spitzer’s Formula
	Anonymous on Spitzer’s Formula
	Anonymous on Brownian Bridges