Up until now, we have discussed parametric models, meaning that we have always made some assumptions on the data generating distribution.
In particular, by assuming that the distribution belongs to a particular family, indexed by a parameter, we turn the difficult problem of estimating the distribution to the simpler problem of estimating a parameter.
Parametric vs nonparametric inference
Sometimes, we may not have enough information to assume a specific parametric form for our data generating distribution.
Alternatively, we may want to obtain results that do not depend on the specific distribution that we assume.
In such cases, we can rely on nonparametric techniques, such as the bootstrap and permutations.
The Bootstrap
The Surprising Adventures of Baron Munchausen (1781)
Frequentist estimates
When we think about an estimator in statistics, we think about a function of the data that has some desirable properties in a frequentist sense.
For instance, the distribution of an estimator can be interpreted in this way: if we repeat the same experiment an infinite amount of time we will get different estimate values, that taken together give us an idea of how variable my estimate is.
Frequentist estimates
If we know the data generating distribution \(F(x)\), we can compute the standard deviation of the estimator by repeatedly sample new data \(x\).
In practice this is impossible because we do not know the distribution \(F\).
The (nonparametric) Bootstrap
The main idea behind the nonparametric bootstrap is to substitute an estimate \(\hat{F}_n(x)\) for the unknown \(F(x)\) and then estimate the standard deviation by sampling multiple times from \(\hat{F}_n(x)\).
This is done by creating a bootstrap sample\[
x^* = (x^*_1, x^*_2, \ldots, x^*_n),
\] from the original data \(x = (x_1, x_2, \ldots, x_n)\), where each \(x^*_i\) is drawn randomly with equal probability and with replacement from \(\{x_1, x_2, \ldots, x_n\}\).
The (nonparametric) Bootstrap
Given a sample of \(n\) observations, the bootstrap proceed as follows:
We independently generate a large number \(B\) of bootstrap samples;
We compute for each bootstrap sample the estimate of the parameter, \(\hat \theta^*_b\), for \(b=1, \ldots, B\).
We compute the empirical standard deviation of the B \(\hat \theta^*_b\) values:
We notice that to obtain our estimate of \(\theta\), we proceed in two steps:
\(x\) is generated by sampling independently from the data generating distribution \(F(x);\)
\(\hat \theta\) is calculated from \(x\) according to a certain function \(t\).
Schematically: \[
F \to x \to \hat\theta.
\]
Motivation
We don’t know \(F\), but we can estimate it by the empirical distribution of the observed data, \(\hat{F}_n\), which puts \(1/n\) probability to each observed point \(x_i\):
\(x^*\) is generated by sampling independently from the empirical distribution \(\hat{F}_n(x);\)
\(\hat \theta^*\) is calculated from \(x^*\) according to the same function \(t\).
In the real world we get to see only one draw\(x\) from \(F\) and hence only one value \(\hat\theta\).
In the bootstrap world we can generate as many draws as we want from \(\hat{F}_n\) and observe the distribution of \(\hat\theta^*.\)
Example: Haplotype frequencies
A haplotype is a collection of alleles (DNA sequence variants) that are spatially adjacent on a chromosome, are usually inherited together (recombination tends not to disconnect them), and thus are genetically linked. In this case we are looking at linked variants on the Y chromosome.
We are interested in the frequencies of particular Y-haplotypes that consist of a set of different short tandem repeats (STR). The combination of STR numbers at the specific locations used for DNA forensics are labeled by the number of repeats at the specific positions.
For instance, haplotype H1 has 14 repeats at position DYS19, 12 repeats at position DXYS156Y, etc.
Suppose we want to find the underlying proportion \(\theta\) of a particular haplotype in a population of interest, by haplotyping \(n=300\) men; and suppose we found H1 in \(y = 40\) of them.
Parametric approach
We can assume that the data generating distribution is \(Y \sim \mathcal{Bin}(n, \theta)\).
By now we know that the maximum likelihood estimate of \(\theta\) is \(\hat{\theta} = 40/300 = 0.13\).
We can compute its standard error \[
se(\hat\theta) = \sqrt{\hat\theta(1-\hat\theta)/n} = 0.0196
\]
The bootstrap distribution gives us a natural way to estimate the confidence interval by looking at the empirical quantiles of the bootstrap distribution
The statistical methods discussed so far are known as frequentist.
The frequentist point of view is based on the following ideas:
Probabilities are objective properties of the real world: they are defined as the limit of the relative frequencies in infinite many trials.
Parameters are fixed, unknown constants.
As a consequence, a frequentist methods are designed to have good long-run frequency properties.
Example
A \(95\%\) confidence interval is defined so that it includes the true value of the parameter \(95\%\) of the times if we repeat the experiment an infinite number of times.
Note that it is not the parameter but the interval to be random!
The Bayesian Phylosophy
The Bayesian approach takes a very different point of view:
Probabilities describe a degree of belief: they are not objective properties of the world, but rather subjective propositions.
Parameters are random variables: there is no such thing as the true parameter, all we know is the probability distribution of its values.
As a consequence, inference in a Bayesian setting takes the form of an “updated” distribution of the parameter: we can then extract point estimates and intervals from this distribution.
Model specification
In frequentist inference, we need a model for the data generating distribution (i.e., the likelihood).
In the Bayesian framework, in addition to the likelihood, we need to specify a prior distribution for the parameter(s).
This prior distribution elicit our prior belief in what the value of the parameter should be.
Inference is carried out by updating the prior distribution with the new knowledge that we observed in the data, to form the posterior distribution.
Bayes’ Theorem
Recall Bayes’ Theorem as a way to “invert” the conditional probability: \[
P(A | B) = \frac{P(B | A) P (A)}{P(B)}
\]
Example: births in Paris
Laplace was the first to formulate a statistical problem in Bayesian terms.
The question he poses was whether the probability of a female birth is or is not lower than 0.5.
Data: in Paris, from 1745 to 1770 there were 493,472 births, of which 241,945 were girls.
A frequentist solution
We can assume a binomial model and get a frequentist estimate for the probability of a female birth as \(\hat\theta = \frac{241,945}{493,472} = 0.4903\)
We can also compute a confidence interval \([0.4889, 0.4917]\)
Finally, we can test the null hypothesis \(H_0: \theta = 0.5\) and reject it as the obtained p-value is close to 0.
A Bayesian solution
Laplace argues that he has no reason to a priori believe that any value of \(\theta\) is more likely than others.
Hence, the prior probability is \(p(\theta) \sim \mathcal{U}(0, 1)\).
We can prove that in such case the posterior distribution is
In statistics, we can rethink the Bayes’ Theorem in terms of the data and the parameter: \[
p(\theta | y) = \frac{p(\theta) \, p(y | \theta)}{p(y)},
\] where:
\(p(\theta)\) is the prior distribution of the parameter
\(p(y | \theta)\) is the distribution of the data given a particular value of the parameter (i.e., the likelihood)
\(p(y)\) is the marginal distribution of the data (across all possible values of the parameter)
Note
Much of Bayesian statistics deals with ways to compute/approximate \(p(y)\).
Example: Haplotype frequencies
A haplotype is a collection of alleles (DNA sequence variants) that are spatially adjacent on a chromosome, are usually inherited together (recombination tends not to disconnect them), and thus are genetically linked. In this case we are looking at linked variants on the Y chromosome.
We are interested in the frequencies of particular Y-haplotypes that consist of a set of different short tandem repeats (STR). The combination of STR numbers at the specific locations used for DNA forensics are labeled by the number of repeats at the specific positions.
For instance, haplotype H1 has 14 repeats at position DYS19, 12 repeats at position DXYS156Y, etc.
Suppose we want to find the underlying proportion \(\theta\) of a particular haplotype in a population of interest, by haplotyping \(n=300\) men; and suppose we found H1 in \(y = 40\) of them.
Frequentist approach
We can assume that the data generating distribution is \(Y \sim \mathcal{Bin}(n, \theta)\), with \(\theta\) fixed and unknown.
By now we know that the maximum likelihood estimate of \(\theta\) is \(\hat{\theta} = 40/300 = 0.13\).
We can compute confidence intervals and test hypotheses on the true value \(\theta\).
Bayesian approach
In a Bayesian setting, \(\theta\) is not fixed, and we need to assume a prior probability distribution for the parameter as well.
In principle we could use any distribution.
It is convenient to specify a Beta distribution, i.e.
\[
p_{\alpha, \beta}(x) = \frac{x^{\alpha-1} (1-x)^{\beta-1}}{B(\alpha, \beta)},
\] where \(B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}\) is called the beta function.
Prior distribution
We can see how the prior distribution depends on two parameters, in this case \(\alpha\) and \(\beta\).
These are known as hyper-parameters and are typically chosen according to some prior information.
The choice of a Beta distribution in this case is mathematically convenient, as one can show that the posterior distribution will be again a Beta distribution.
We say that the Beta distribution is the conjugate prior for the binomial model.
How to choose the hyperparameters
The hyperparameters are our way to express our prior knowledge of the problem.
An intuitive way is to think about the prior mean and variance as a way to elicit what is the value that we expect before performing the data (mean) and how confident we are in our prediction (variance).
The Beta distribution is quite flexible in that it includes the uniform distribution (absence of prior information) as well as very concentrated distributions.
How to choose the hyperparameters
If \(X \sim \mathcal{Beta}(\alpha, \beta)\), then \[
E[X] = \frac{\alpha}{\alpha + \beta}
\] and
We have no prior knowledge on the haplotype frequencies in the population: we can choose a uniform distribution.
We know from the literature that one in five men should have haplotype H1, but we are not very confident.
We are very confident that one in five men has haplotype H1.
Example: Prior distribution
The posterior distribution
If the prior is Beta and the likelihood is Binomial, the posterior distribution is still Beta, but with updated parameters: \[
p(\theta | y) \sim \mathcal{Beta}(\alpha + y, \beta + n - y).
\]
Many of the concepts that we have seen in this example extend to many other distributions!
For instance if we have Gaussian data, a Gaussian prior on the mean will lead to a Gaussian posterior.
This is true for the linear model too!
Not just conjugate priors
There are many different choices of prior distributions, and when closed form solutions do not exist to compute the posterior, we need to use numerical approximations or simulation approaches.
Markov Chain Monte Carlo (MCMC) and Variational Approximations are perhaps the two most popular techniques.