Bayesian Statistics by Example

Who is this lecture for?

I begin every course/lecture with a brief description of the hypothetical student I had in mind when designing the course. This is in part to set the audience's expectations, but also to keep me true to the intended spirit of the course as I write it. This course is intended for those with only some experience with statistics. You are probably vaguely familiar with simple problems like coin flips, t tests, and the monty hall problem...but you approach statistics as you would that guy you once said hi to in the elevator that one time (i.e. awkwardly).

However, you WANT to know more. You are interested not only in applying statistical methodologies to problems, but actually understanding where the methodologies come from and the philosophic paradigms from which they were spawned. You understand that applying math can be as much an art as a science, and not, as many think, a game of rock-paper-scissors where one need only guess the right method for the right situation.

There are many more advanced and thorough treatments of bayesian statistics out there. Unfortuantely, they are often written with a more advanced student body in mind. This is rather tragic, as the beauty of Bayesian statistics is that it presents a philosophy of statistics which is intuitive, explainable, and simple. This lecture sets out to prove that Bayesian statistics, even in it's most complex and powerful forms, can be explained in english through a set of simple, common problems any practitioner of the dark probablistics arts may encounter. At times we will delve deep into the math behind our concepts, but the course is (hopefully) structed such that you can skip most of the more tedious proofs/derivations and simply focus on concepts and practice.

DO NOT read this whole lecture in one sitting. Read a section and let it sit and ferment in your head. The point of these examples isn't to memorize them, but to really understand them. Bayesian statistics is about building intuition around data and quantifying that intuition in a clear and powerful way...not memorizing formulas. Try to take an active part in the lessons by tweaking the code, changing parameters, or making up your own examples for the concepts!

While I have shaped and ammended much of the content shown here, almost 100% of this lecture is taken from other source material (which I reference when possible). Nothing here is new, I've just molded it into a more palatable form than you might find elsewhere. If you enjoy this course, please keep an eye out for my other courses and blog posts at!

I have mostly drawn material in this course from Thomas Wiecki, John Kruschke, and Andrew Gelman and encourage you to look up their works on bayesian statistics. I would also like to thank Mack Sweeney, Jason Wittenbach, Josh Touyz, and Keegan Hines without whom I wouldn't have nearly the grasp of Bayes' Theorem that I have today.

What is statistics?

At the risk of immediately losing your attention, I start with a purely philosophical question: what is statistics? I'm not asking for a dictionary definition...we all know that statisitics is a branch of mathmatics. I'm really asking: what is the goal or point of using statistics?

I believe that the purpose of statistics is to quanitfy our beliefs. To not only state that something has or will occur, but also the certainty that it has or will occur. We want to state these beliefs in a way that is standardized and logical so that we can convince others to share our beliefs or at least have a reasonable dialogue as to why they might not share our beliefs. To this end, our beliefs become only as powerful as the medium through which we express ourselves to convince others of the truth of our beliefs. In other words, academia often demands that we put our math where our mouth we should be good at math.

There have been many tries to create a universally accepted paradigm for expressing opinions in statistics. We will discuss the two most prominent ones: the frequency based philosophy and the belief based philosophy...aka Frequentist statistics and Bayesian Statitistics.

For those strict adherants to the cults of Frequentism and Bayesianism...I apologize in advance for breaking down what are nuanced and complex methodologies into heuristics and crude analogies, but I'll reiterate that this is not meant to be a rigorous course on statistics, only an illustrative pass through the philosophy behind Bayes' Theorem.

The Frequentist Philosophy

You're probably quite familiar with frequentist statistics, even if you don't know it. It's the philosophy which guides most high school and college stats courses. This philosophy has inspired such famous methodologies as the student t test and bootstrap sampling.

Unfortuantely, you're most likely more familiar with the methodologies produced by this philosophy than the philosophy itself. But we don't want to take methodologies on blind faith; we want to know where they come from! Buried in the assumptions of every frequentist model of the world are a few guiding principles.

To explain these principles, I'll borrow (and bastardize) an analogy from Plato. Let's say that we're looking at the wall of a cave. Behind us is a fire lighting up the cave. Someone places an object behind us and in front of the fire such that we can't see the object but we can see it's shadow. We carefully take notes on the characteristics of it's shadow over time as the flame flickers and dances about. We might note the frequncy with which the shadows take certain shapes and sizes. This provides us with information about the nature of the object behind us, but also some ability to predict the nature of shadows that the object might cast were we to continue to sit and watch the wall.

In this example the shadows are samples that we observe through testing. The object represents some 'true' distribution of data that is fixed but that we can not directly observe. By collecting samples over time (observing data) we can gather information about the frequency of various events.

Probability, under this paradigm, becomes an expression of relative frequency (how often an event occurs comapred to other events).

Let's work through an example! Let's say that we have a coin of unknown bias. This means that we don't know the probability of it coming up heads or tails. First, we'll try to answer a very simple question: "what is the probability of getting heads on our next flip?"

The subtext in this example is that when we ask the question "what is the probability of getting heads?", we are actually asking "given the observed relative frequency of getting heads, how likely am I to observe heads if I flip again?".

We define this probability as the reatively frequency with which we see heads during experimentation as the number of samples observed goes to infinity.

$P(heads) = \lim_{n\to\infty} (n_{heads}/n$)
$n_{heads}$ = # of trials where we got heads
n = number of trials total

Now we can't realistically perform an infinite number of flips. Let's say we take a single sample of 10 observations (1 = heads, 0 = tails):


The relative frequency of seeing heads is 5/10 (you'll note that this is merely the mean of the observations or the 'sample mean'). Our expectation for the next flip being heads is 50%. We'll assume for now that consecutive coin flips do not affect each other (they are independant).

However, the point estimate of our probability (50%) holds no information about how certain we are that we have captured the 'true' probability of getting heads in our next flip. It provides no information as to how certain we are of our prediction. What if we haven't conducted enough trials? What if there is noise in our experiment that is unaccounted for?

We could try to come up with some range of values wherein we think the 'true' value lies. The magnitude of this range is proportional to the uncertainty within our sample. Well, here we start to get into the math a bit. Feel free to skip the derivations if you're not interested!

We've said that the probability of seeing an event is proportional to the relative frequency with which we see that event. Let's write that down in math.

$P(heads|p)=p^{\sum_1^n x_i}(1-p)^{\sum_1^n1-x_i}= p^{x}(1-p)^{n-x}$
p = relative frequency of seeing heads
$x_i$ represents an individual trial
n = the total number of trials

This equation gives the probability of seeing heads in n trials given the relative frequency of seeing heads that we have observed. We call this the liklihood. The probability of getting heads if we only flip the coin once, given that the relative frequency of heads we previously observed was 50%, is:


Just as we would expect. It's neat to note that the liklihood above is maximized when p = x / n (test this by setting the derivative of the log liklihood with respect to p to zero and solving for p).

We take the log of our liklihood (the logliklihood):


The first derivative of the logliklihood:


And the second derivative of the logliklihood:


We use these to calculate the Fisher information of our parameter. Without going into too much detail, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.

$I(p)= \sum_{i=1}^{n} \dfrac{1}{p(1-p)} =\dfrac{n}{p(1-p)}$

Each observation carries a certain amount of information (dive into the fisher information). The sum of this information gives us the cummulative amount of information contained in the sample about a particular unknown variable (in this case the 'true' mean). We could say that the inverse of the information in the sample is the variance of the sample.



This makes sense given that our variance/uncertainty will decrease as the number of observations increases.

Now that we have a measure of uncertainty/variance, let's define our range of values in which we think the 'true' mean may lie. We design this range such that it increases with our uncertainty. We set a tuning parameter Z depending on how sensitive we want our range to be to capturing the 'true' parameter.

$\hat{p}\pm Z \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}$
$\hat{p}$ = sample mean
n = number of trials
Z = z score constant

We typically call this a confidence interval and set Z to 1.96 (95% confidence interval). This interval roughly captures 95% of the observations within our sample. For a more thorough treatment of binomial confidence intervals, see here.

Now the point estimate was easy to interpret. it represented the relative frequency of seeing a particular event. How do we interpret this range of values?

We can interprete this interval as follows: were this test to be repeated on numerous samples, the fraction of calculated confidence intervals (which would differ for each sample) that encompass the true population parameter would tend toward 95%.

At least for me, this is unintuitive and disappointing because, as you'll note if you read carefully, we are not able to say much about our certainty of a specific outcome in a specific sample. We are quantifying our uncertinty across samples, rather than of seeing any particular sample. The probability that any particular sample confidence interval captures the 'true' mean is either zero or one (and we don't know which)!

At this point you might justly respond, "So what if 95 out of 100 theoretical experiments yield a confidence interval that includes the true value? I don't care about 99 experiments that I DIDN'T DO; I care about the experiment I DID DO. Additionally, this methodology allows 5 out of the 100 theoretical experiments to yield complete nonsense as long as the other 95 are correct; that's ridiculous." If you look carefully at our equaltion for capturing the Fisher information, you'll note that for small sample sizes there is a very real possibility of capturing an outcome that pushes the lower bound of the confidence interval below zero (realistically impossible) or to have a confidence interal of zero width (again, unrealistic).

We can mitigate the risk of these events occuring at low sample sizes by amendeding our calculation of the expectation from p = x/n to:

$p = \dfrac{x + 0.5}{n + 1}$

This is the same as saying we observed half a success and half a failure and serves to keeps the confidence interval from becoming degenerate. If that makes you feel uncomfortable...good! We've imposed some strange prior belief (that we expect the coin to be fair) upon our experiment even before we take into account our observations.

Well, let's take one more shot at quantifying the uncertainty in our prediction. Let's assume that the distribution of outcomes in our sample is representative of the 'true' population. This doesn't seem like a horrible assumption given that if the sample wasn't representative of the 'true' population there would be no point in using it to make predictions at all!

We could sub-sample our sample and calculate the variance among our sub-samples! Let's say we take twenty sub-samples (with replacement) of five from our ten observations and calculate the variance of the means of these sub-samples. We get a standard deviation of 14.8%. This metric seems a bit odd. For one thing, it is independant of the number of samples. I could have performed this methodology on a sample with one observation! We call this method Bootstrap Sampling

Okay, so expressing probability completely in terms of frequencies is a bit of a let down. What's the alternative?

The Bayesian Philosophy

Bayesians reject the Frequentist idea that there is a 'true' population from which we pluck samples with some probability. Bayesian statistics tries to set up a framework for evaluating how well sets of hypothesis describe the process by which observed data was generated.

Rather than defining probability as sets of relative frequencies, bayesians define probability as plausibility. This is a huge difference! A Frequentist can't say that they are X% sure of their prediction, only that X% of experiments of similar methodology would produce conclusions that would help arrive at accurate predictions. A Bayesian could say that they are X% sure of their prediction!...but would you believe them?

First, let's write out the Bayesian philosophy of probability in math jargon (as we did with the frequentist philosophy).

We start with a very basic axiom of probability, the joint probability:


This says that the joint probability of A and B (the probability of both events occuring together) is equal to the probability of event A occuring mutliplied by the probability of event B occuring given the fact that A has occured. That is pretty fundamental.

Let's say that A is the probaility of my mother visiting. B is the probability that I am eating Indian food (my mother is Indian if that helps). My mother randomly visits once a month (1/30). My mother insists on eating Indian food with me every other visit (1/2). The probability that my mother is visting AND that I am eating Indian food is therefore (1/30)*(1/2)=1/60.

From the joint probability, we can show that:


and so:


Dividing both sides by P(B) gives us:


We call P(A) our prior belief. It represents our belief about the outcome before we see any test data. P(B|A) is our liklihood. It represents the probability of seeing our test data given some outcome is true. P(B) is our evidence. It represents the probability of seeing the data that we saw. Finally, P(A|B) is out posterior. It is our belief in the outcome given our prior beliefs and the data that we observed. This is called Bayes Theorem and is the under pinning of Bayesian Statistics.

To illustrate how to use this theorem, let's try calculating the probability that a card (from a standard deck) is a King given that it was a face card (jack, queen, king).


Let's try this theorem out with our coin flip problem. To spice things up and keep you interested, let's assume that we just saw 10 heads in a row! Wow, now that is weird.

First, let's define our prior assumptions. I don't know about you, but almost every coin I've encountered has been a fair coin (50% chance of heads). I therefore create a strong prior belief that the coin is fair (remember, I haven't flipped the coin yet! This is my belief with no data). However...I have this naging feeling that the coin might be unfair. Perhaps it dropped from a magician's pocket or it has some manufacturing imperfection. I therefore say that I am 97.9% certain that the probability of getting heads is 50%, I am 1% certain that the coin fell from a magicians pocket and is perfectly biased (always heads or tails), and that I am .1% sure that it has some other biasing imperfection.

$\delta$ is the dirac delta function

Next we calculate the liklihood. This is rather simple as it's the product of the outcome of each of our observations.

$p(B|A)=A^{n} $
n = # of trials

Finally we can put the evidence, P(B), in terms of the prior and the liklihood by integrating over every posible value of A. This is like calculating the probability of see the data that we did given every possible bias of the coin.

$P(B)={\int_0^1 p(B,A)}dA={\int_0^1 p(B|A)p(A)dA}={\int_0^1 A^{10}p(A)dA}$

Let's put it all together:

$p(A|B)=\frac{p(B|A)p(A)}{\int_0^1 p(B|A)p(A)dA} = \frac{A^{10}p(A)}{\int_0^1 A^{10}p(A)dA}.$

And we can now solve for our belief:

$\begin{multline} \int_0^1 A^{10}p(A)dA=0.979\cdot0.5^{10}+0.01\cdot 1^{10}+0.01 \cdot 0^{10}

  • 0.001\int_0^1 A^{10}dA = 9.56\cdot 10^{-4} + 0.01 + 9.09\cdot 10^{-5}=0.0110 \end{multline}$


We now have our posterior belief. We are 90.5% sure that the coin will always return heads. We are 8.7% sure that the coin is fair and the data was just coincidence. And we are 0.8% sure that the coin has some other bias.

Now our posterior belief describes how certain we are that the coin is baised. How do we find the probability that the next flip will be heads? That part is simple, we calcualte the probability of getting heads given our data and belief that the coin is biased:

$\begin{multline} p(Heads|B) = \int_0^1 p(Heads|B,A)p(A|B)dA = \int_0^1 A \, p(A|B)dA = 0.087\cdot 0.5+0.905\cdot 1+0.091\cdot \int_0^1A^{11}dA = 0.956. \end{multline}$

After seeing 10 consecutive heads, we are 90.5% sure that the probability of getting heads is 100%. And after seeing 10 consecutive heads, we are 95.6% certain that we will see heads on the next flip. That is a lot of useful information!

Now these conclusions are heavily dependant on our prior beliefs. A Frequentist might challenge us to justify why our result should be impacted by our intuitions before observing data. It is up to us to identify a prior that makes sense and that we can defend infront of our peers.

In this case I used a very strong prior centered around 50% that had a reasonable impact on our posterior (look at that 8.7% belief), but I could have engineered a weak prior (such as a uniform distribution between 0 and 1) that would have had very little impact on the posterior. I rather like the prior that I used given that in real life it would take more than 10 consecutive observations of heads to make me believe that a coin was completely biased.

It's also worth noting that the more observations that we have, the more the liklihood will influence the posterior versus the prior. This makes sense. The more data we see, the more it will over power our previous beliefs.

It's interesting to consider how the frequentist would have reacted to the ten consecutive heads. The frequency of getting heads was 100%. The variance around that estimate was 0%. The Frequentist might say that they are 100% sure that they will get heads if they flip again. Does that seem right?

This problem illustrates a weakness in describing probability in terms of frequencies. As the number of trials decreases, the assumptions of Frequentist methodologies tend to break down. Now, to make sure I'm giving Frequentists a fair shake, it's worth noting that there are strategies for accounting for the degeneracies of low sample sizes (mentioned briefly in the last section) ...but in my opinion, they suspiciously resemble the injection of prior beliefs into the Fisher Information characteristic, so why not go all in on Bayes anyway?

Reference: Example by Roman Cheplyaka

Bayesians vs Frequentists (Cookie Example)

We just covered a lot of material. We dove deep into the Frequentist and Bayesian philosophies of probability. Additionally, we ran through simple coin flipping exercises to provide tagible examples of how each philosophy might be applied in practice. To make sure we really understand both approachs, we're going to run through another example.

My girlfriend loves chocolate-chip cookies! Who doesn't? She orders cookies from a particular baker that sells cookies in four different packages: A,B,C,D. Each package type contains cookies with different amounts of chocolate-chips. My girlfriend orders a new package every time we run out of cookies. Each package always has 100 cookies.

Over time, I collect data on the expectation of getting X number of chocolate-chips when I pull a random cookie out of a particular package type.

P(chips | package) A B C D
0 1 12 13 27
1 1 19 20 70
2 70 24 0 1
3 28 20 0 1
4 0 25 67 1
Total 100% 100% 100% 100%

The A package, as an example, has 70 cookies with two chips each, and no cookies with four chips or more. The D package has 70 cookies with one chip each. Since each column sums to one, each column is a probability mass function (pmf)for a particular package!

Now when my girlfriend brings home a new package of cookies I randomly take one cookie, count the chips, and try to express my uncertainty (at the 70% level) about what package she bought. I am trying to guess the identity of the package, so the package identity is the parameter being estiamted. The number of chips in the cookie I sampled is the outcome/observation.

Now the frequentist in me went straight to calculating 70% confidence intervals. Remember that chip count is my observable, so I construct my confidence intervals along the rows in my table. >=70% of the pmf of each column should be covered within my confidence intervals.

P(chips | package) A B C D
0 1 12 13 27
1 1 [19 20 70]
2 [70 24] 0 1
3 28 [20] 0 1
4 0 [25 67] 1
Total 100% 100% 100% 100%

For example, if the number of chips in the cookie I draw is 1, my confidence interval will be {B,C,D}. If the number is 4, my confidence interval will be {B,C}. Notice that since each column sums to 70% or greater, then no matter which column we are truly in (no matter which package my girlfriend bought), the interval resulting from this procedure will include the 'correct' jar with at least 70% probability based on the historical data.

Notice also that the procedure I followed in constructing the intervals had some discretion. In the column for type-B, I could have just as easily made sure that the intervals that included B would be 0,1,2,3 instead of 1,2,3,4. That would have resulted in 75% coverage for type-B jars (12+19+24+20), still meeting the lower bound of 70%.

The bayesian in me thought that this was rediculous. I was completely ignoring my girlfriend as part of the system! I should assume that the identity of the package is a random variable and that she chooses randomly when selecting packages (I have a prior belief that she doesn't care about chocolate chip density and so chooses randomly). Assuming she picks the package randomly, let's look at the joint probability of getting a particular package and pulling out a cookie wih a certain number of chips.

P(chips,package) A B C D Total
0 1/4 12/4 13/4 27/4 13.25%
1 1/4 19/4 20/4 70/4 27.50%
2 70/4 24/4 0/4 1/4 23.75%
3 28/4 20/4 0/4 1/4 12.24%
4 0/4 25/4 67/4 1/4 23.25%
Total 25% 25% 25% 25%

Before, each column was a probability mass function. Now, the whole table is the probability mass function! This happens because we now consider the probability of choosing a particular package as well as picking a particualr cookie.

I was previously looking at the conditional probability of the number of chips given the package. But what I really cared about was the conditional probability of which package she bought given the number of chips in the cookie I sampled.

Let's say that we found 3 chips in our cookie. We can ignore all rows except the 3 chip row and treat that particular row as the pmf. First, we rescale each row so the pmf sums to 100%.

P(package | chips) A B C D Total
0 1.9 22.6 24.5 50.9 100%
1 0.9 17.3 18.2 63.6 100%
2 73.7 25.3 0.0 1.1 100%
3 57.1 40.8 0.0 2.0 100%
4 0.0 26.9 72.0 1.1 100%

We've flipped the conditional probabilities we started out with! Now we're considering the probability that my girlfriend bought a particular package given the chips in the sampled cookie instead of the probability of the number of chips given a particular package.

We look at each row and highlight an interval representing 70% of that row's pmf. We call these 'credibility intervals' (the bayesian measure of uncertainty as opposed to frequentist confidence interval).

P(package | chips) A B C D Credibility
0 1.9 22.6 [24.5 50.9] 75%
1 0.9 17.3 [18.2 63.6] 82%
2 [73.7] 25.3 0.0 1.1 74%
3 [57.1 40.8] 0.0 2.0 98%
4 0.0 26.9 [72.0] 1.1 72%

Okay. Let's pause and consider the overlap of our confidence and credibility intervals.

Confidence A B C D Credibility
0 1 12 13 27 0%
1 1 [19 20 70] 99%
2 [70 24] 0 1 99%
3 28 [20] 0 1 41%
4 0 [25 67] 1 99%
Coverage 70% 88% 87% 70%
Credibility A B C D Credibility
0 1 12 [13 27] 75%
1 1 19 [20 70] 82%
2 [70] 24 0 1 74%
3 [28 20] 0 1 89%
4 0 25 [67] 1 72%
Coverage 98% 20% 100% 97%

Those confidence intervals are looking pretty crazy. For example, we don't even have a sensible prediction if we draw a cookie with no chips. Also consider the 3 chip scenario, our 70% confidence interval is only 'correct' 41% of the time, given past data...I wouldn't call that 70% confidence.

But the credibility intervals are a bit odd as well. The confidence intervals are right 70% of the time, given past data, agnostic of what package my girlfriend buys. Look at our credibility interval for package B. Our interval is wrong, based on past data, 80% of the time!

Under the bayesian approach, our mistakes will be correlated with the type of jar. If I send out 100 'Bayesian' robots to assess what type of package I have, each robot sampling one cookie, on type B days, I will expect 80 of the robots to get the wrong answer, each having >73% belief in its incorrect conclusion! That's troublesome, especially if I want most of the robots to agree on the right answer.

Plus we had to make the assumption that my girlfriend randomly picks packages! How well do I know that?

But if my assumption is correct that she does pick randomly, then type B packages happen only 25% of the time. Poor predictions for type B are balanced out by my good coverage of type A, C, and D packages. And, unlike the frequentist approach, I never publish nonsense if I pull out a zero chip cookie.

On the other hand, while it's true that my confidence interval does perform poorly when I've drawn a cookie with zero chips, so what? Chipless cookies happen, at most, 27% of the time in the worst case (a type D package). I can afford to give nonsense for this outcome because NO jar will result in a wrong answer more than 30% of the time, based on historical data.

Hopefully this example has helped explain some of the differences between the Bayesian and Frequentist approaches. If not, try re-reading the examples. These are complicated philosophies that become even more convoluted in practice. Take your time, take deep breathes, and try working through everything yourself step by step.

Reference: Example from Keith Winstein

Author's Note: Taking Sides

So right about now you might be asking, who is right? Since this course is called Bayes by Example, you might assume that I'm taking the Bayesian side. No! ... well, kind of. I definitely prefer the Bayesian approach. I'm skeptical of theoretical 'true' populations, embedded assumptions (as opposed to clearly stated priors), and unintuitive uncertainty measurements. However, Bayesian and Frequentist approaches will often yield very similar results and conclusions (if executed and interpreted correctly).

Additionally, as a physics major in a former life, I'm quite confortable with the inverse problem where we are given some data and asked to hypothesize the process by which it was generated. This slides in comfortably next to the Bayesian philosophy of hypothesizing sets (possibly infinite sets) of parameters and quantifying the probability that those parameters generated our data. We can visualize this conceptual difference below:

In machine learning the difference between the methodologies widens a bit. The bayesian philosophy has lead to radically different ways of thinking about model training and optimization, the bread and butter of machine learning, as well as ways of quantifying the uncertainties of model predictions.

At this point we drop the Frequentist point of view (with a few exceptions) and put on our Bayesian thinking caps. Throughout the rest of the course, we're going to go through a set of examples of how Bayesian statistics affects modern machine learning and data science.

In [ ]: