Filesystem
Bayesian Modelling - Revision
This is a summary of bayesian probabilistic modelling, I'm reading from Bishop - Pattern Recognition and Machine Learning.
Outline
- Max Likelihood Estimate (= least Sq.)
- Bayes -> MAP (= regularised least sq.)
- Bayes -> Predictive distribution
Notation
$$ p(X=x) = p(x) $$
Models
Suppose we have some data, it consists of corresponding pairs of x and y values. We might try to model it with a function like this.
$$ y = mx + \epsilon $$
Items in our dataset may not satisfy this equation, so $\epsilon$ is a random variable that represents the error. The parameter $m$ remains to be determined. This is a probabilistic model, and by leaving m undefined we create a space of linear functions.
Least Squares
One approach to choosing m is to minimise the total error. The choice of squaring here is unjustified here, it could in principle be other norms.
$$ m = \text{argmin}_m \sum_i (y_i - mx_i)^2 $$
We can solve this analytically by finding a zero of the derivative.
Maximum Likelihood Estimate (MLE)
We can take a more probability oriented approach like this:
$$ m = \text{argmax}_m \prod p(y_i | x_i) $$
or equivalently
$$ m = \text{argmin}_m - \sum \text{ln}(p(y_i|x_i)) $$
If we choose a Gaussian distribution for $\epsilon$, the probabilistic model gives us:
$$ p(y_i|x_i) = N(mx_i, \sigma^2) $$
And further algebra, finding a zero of the derivative, leads to the same analytical solution as least squares. A pessimist might say we haven't gotten very far, but this probabilistic treatment isn't useless.
Distribution over Parameters
Probability can offer us more than this. Rather than having a point estimate of $m$, can we get a distribution over $m$? This could be more informative. Fundamentally, probabilistic modelling relates model parameters $w$ ($m$) to data $D$ ($x,y$). In a general notation, we maximised the likelihood of the data before:
$$ \prod p(y_i | x_i) = p(D|w) $$
Bayes rule allows us to change the subject of the expression:
$$ p(w|D) = \frac{p(D|w)p(w)}{p(D)} \propto p(D|w)p(w) $$
Maximum a Posteriori Estimate (MAP)
Instead we can maximise this to find m, therefore avoiding having to work out the normalisation constant.
$$ m = \text{argmax}_m \big( \prod p(y_i | x_i)\big)p(m) $$
We can choose to use a Gaussian "prior" $p(m)$, this will cause the model to treat $m$ being closer to zero as more likely. Pursuing the same algebra as before, negative log likelihood etc, you will find there is an extra term. When performing the optimisation this term penalises large values of the weights.
This is equivalent to performing least squares, and including what is known as a regulariser in the objective function, which prevents overfitting by discouraging excessively large parameter values.
Still, we've achieved another result attainable via least squares. But now two seemingly unrelated least squares techniques, square error and regularisation, have been neatly explained using a full probabilistic approach.
Predictive Distribution
Using general notation again, if we can normalise $p(D|w)p(w)$, to obtain $p(w|D)$, we are in a position to keep propagating probabilities through our modelling, all that way to our prediction. Rather than using a point estimate of the parameter, $m$, we can keep $m$ as a distribution and try to derive $p(y'|x',D)$. With $y'$ and $x'$ together being a new data point.
Applying the sum and product rules:
$$ p(y'|x',D) = \int p(y'|x',w)p(w|D) dw $$
Now, in the notation for our model:
$$ p(y'|x',D) = \int p(y'|x',m)p(m|D) dm $$
What this represents is a predictive distribution which does not come from just one unique m value. Instead contributions to the distribution have been weighted according to the likely values of $m$ computed from the data. It has distinct advantages over point estimating the model parameters.
Advantages of distributions
They give you uncertainty estimates. If the model fits the data poorly the prediction for the new value will have a large variance. It can also be more accurate. An exercise for the reader, or myself in a months time, can you demonstrate it?
Further Bayesian Modelling
We did not introduce any non-linearity into the model in this article. Those familiar will know that you can "augment" the input $x$ values with non-linear functions of $x$ enabling non-linear modelling.
This construction leads naturally toward Gaussian Processes, particularly when using a squared exponential (RBF) kernel.