Introduction
The loglikelihood function \(l\) is a function such that:
\[\mathcal l(\theta; x) = ln \; p_{\mathcal M}(x\theta)\]… where p is the generalized density under a model, evaluated on the dataset.
The loglikelihood is an interesting beast. It not only captures the complete model law and the fed dataset (or a sufficient statistic of it), but it also has deep connections with the ideas of information entropy. It’s exceptionally wellbehaved (numerically) and many modern model comparison methods such as the evidence measure its value or an adjusted (bias? corrected) expectation of it. Typically with i.i.d. experiments, it’s approximately normally distributed with an expectation equal to the model entropy.
Loss Functions
It’s quite easy to see why the loglikelihood enforces a model. While optimizing literally anything, one usually has a “loss” containing a function of data and parameters, and another that’s a function only of the parameters. Everything else disappears when a derivative is calculated, and this raises an interesting question. If every loss can be written as:
\[ \mathcal L(\theta; x) = f(x;\theta) + g(\theta)\]… then what really is the difference between a loss function and a negative log posterior? As far as I’ve been able to tell, there’s no difference (unless we’re talking about a decisiontheoretic loss, which is a whole entire matter).
Case in point: Regularization
Adding a penalty of:
\[k  \underline \beta ^p\]… to the regression framework leads to “insignificant” parameters being forced down/close to zero (this actually introduces bias, but biased estimators can have a lower MSE than unbiased estimators). From an optimization perspective, \(k\) is a Lagrangian multiplier; for each value it takes, a different curve centered on \((0, 0, ...)\) is enforced and a value on the likelihood g is found such that the boundary enforcement is met:
… but this can be interpret as placing strong Laplacian/Gaussian priors on the parameters and obtaining a MAP solution.
Model Laws
By constructing logical model laws, one can obtain amazing results with probability theory. Examples:

We can reason about the probability of failure in a repeated Bernoulli experiment even if no failures are observed! Assume that a process represented by \(X \sim Bin(n, p)\) produces a set of \(n\) successes. We can still construct a posterior by writing out the likelihood of this model. This would obviously include 0 (and the highest mass would be there) but an interval with a 5% error rate would be \([0, 3/n]\).

When dealing with censored data, we can reframe the problem and derive a likelihood. Here’s my attempt at that:
Let a random variable \(\tau\) have a cumulative distribution function \(F_{\tau}\).
Now, let \(X\) be a random variable such that:
\[X = \begin{cases} \tau, & \text{if $\tau \leq T$} \\ T, & \text{if $\tau \geq T$} \end{cases}\]Then, it’s easy to see that the cumulative distribution function of \(X\) is:
\[F_X(x) = \begin{cases} F_{\tau}(x), & \text{if $x < T$} \\ 1, & \text{if $x \geq T$} \end{cases}\]Let \(\nu = l + \delta_T\) be a reference measure that is the sum of the Lebesgue measure \(l\) and the Dirac measure \(\delta_T\) on the set \(\{T\}\). Then, the density on \(X\) will be a function \(h\) (the RadonNikodym derivative) such that:
\[F_X(x) = \int_{\infty}^x h d\nu = \int_{\infty}^x h(u) du * \mathcal I_{x < T} + h(T) \mathcal I_{x = T}\]A function that satisfies the form above is:
\[h(u) = \frac{dF_{\tau}}{du}(u) \mathcal I_{u < T} + (1  F_{\tau}(T)) \mathcal I_{u = T}\]
Dimensionality
Volumes in High Dimensions
A lot of probability theory uses the idea of volumes. For example, the probability and expectation are usually defined as:
\[\mathbb P_{\mu} (\mathcal A) = \int_A d \mu \;\;\; ; \;\;\; \mathbb E_{\mu} (f(X)) = \int_{\chi} f(x) d \mu(x)\]… where the differential element is usually a volume, \(d \mu(x) = p(x) dx_1 ... dx_d\), and p is a density, but volumes are really weird in higher dimensions.
The main idea is that our ideas of distance no longer make sense. Consider the following: I simulated a bunch of random normal points in many dimensions and plotted the average ratio of the distance to nearest point to the distance to farthest point:
This ratio tends to one, implying that the nearest and farthest points are very “close” together / all points “look alike”.
On the same subject, consider a cube of unit lengths containing a sphere of unit radius in higher dimensions:
The edges of the cube keep more and more volume away from the sphere. In fact  the volume of a sphere starts decreasing after 5 dimensions.
All of this discussion has a huge implication on computational statistics, as all of our sampling methods rely on a density defined on some volume. Moreover, this has many other areas of effect, for example, in optimization and cluster analysis as points start getting indistinguishable as we move into higher dimensions.
Effect on Likelihoods
This discussion leads to one very interesting observation: a probability is not a good measure of typicality outside of the first dimension.
Consider the following: In one dimension, if I draw from a normal distribution, numbers close to zero would be the most reasonable guess at what the draw could be. In higher dimensions, this isn’t really the case  if I draw from a 1000dimensional normal, it’s ridiculously unlikely that I’d pick a vector that (0, 0, 0, …); a more reasonable guess would be “a vector of points where the mean radius is ye much”.
A very good analogy I’ve read on CrossValidated is: let’s say that the weights, heights, eyesight, athleticism, intelligence, … of people is normally distributed. If I picked a random guy on the street and asked him his weight, it would be reasonable to expect that it’d be around the average. Instead, if I asked the person about their height, weight, eyesight …, it’s very unlikely to meet a person who’s average at everything. I’m a bit more heavy, averagely tall, more booksey … than other people, for example.
Now consider the likelihood of an isolated repeated experiment:
\[L(\theta  x) = \prod_i^n p(x_i\theta)\]This is a function of many random variables (which collectively have a dimension n). This will have it’s own sampling distribution, and the position where it’s maxmized w.r.t. any of the x’s isn’t a typical value of the likelihood at all.
Example:
The maximum possible value of the loglikelihood in this case is 91.9, although it’s always much lower than that. Loglikelihood values around 140 for this experiment are typical.
We call values that lie in this region of typicality the typical set and it’s geometry is expected to be weird. From the other results, we expect it to be a thin layer of mass lying on a shell somewhere in a high dimensional space.
From Andrew Gelman’s blog: Interpolating between any two values of the typical set is rather unlikely to result in another typical sample; this is apparently why optimizers and “ensemble methods” (ones which initialize a bunch of points and move them around till they look like a sample from a density) are a bad way to sample and fail in higher dimensional spaces. This is why HMC is said to be so powerful; to sample from an arbitrary density, we can’t use grid or ensemble based methods, so a logical alternative is to start with a sample and obtain another sample, which leads logically to the idea of Markov Chain Monte Carlo, at which the Hybrid/Hamiltonian method is rather efficient.
Michael Betancourt has an even more interesting example; instead of sampling, Betancourt first reparameterizes a target density and integrates out all of the nuisance parameters to obtain a marginal distribution of a set of normals w.r.t. their radius. See the booklet “Logic, Probability, and Bayesian Inference” by Betancourt & the Stan Dev Team.
I believe that the sampling distribution of a model’s loglikelihood can be very interesting as a comparison / validation tool. I believe that there’s a link between MacKay’s Bayesian Occam Razor and these sampling distributions.
R Code for Dimensionality Image
2021
Efficient Gaussian Process Computation
I’ll try to give examples of efficient gaussian process computation here, like the vec trick (Kronecker product trick), efficient toeliptz and circulant matrix computations, RTS smoothing and Kalman filtering using state space representations, and so on.
Gaussian Processes in MGCV
I lay out the canonical GP interpretation of MGCV’s GAM parameters here. Prof. Wood updated the package with stationary GP smooths after a request. I’ve run through the predict.gam
source code in a debugger, and mainly, the computation of predictions follows:
Random Projects
Random Projects
Photogrammetry
I wanted to see how easy it was to do photogrammetry (create 3d models using photos) using PyTorch3D by Facebook AI Research.
Dead Code & Syntax Trees
This post was motivated by some R code that I came across (over a thousand lines of it) with a bunch of ifstatements that were never called. I wanted an automatic way to get a minimal reproducing example of a test from this file. While reading about how to do this, I came across Dead Code Elimination, which kills unused and unreachable code and variables as an example.
2020
Astrophotography
I used to do a fair bit of astrophotography in university  it’s harder to find good skies now living in the city. Here are some of my old pictures. I’ve kept making rookie mistakes (too much ISO, not much exposure time, using a slow lens, bad stacking, …), for that I apologize!
Probabilistic PCA
I’ve been reading about PPCA, and this post summarizes my understanding of it. I took a lot of this from Pattern Recognition and Machine Learning by Bishop.
Modelling with Spotify Data
The main objective of this post was just to write about my typical workflow and views rather than come up with a great model. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary on making a model with it.
Random Stuff
Random Stuff
Morphing with GPs
The main aim here was to morph space inside a square but such that the transformation preserves some kind of ordering of the points. I wanted to use it to generate some random graphs on a flat surface and introduce spatial deformation to make the graphs more interesting.
SEIR Models
I had a go at a few SEIR models, this is a rough diary of the process.
Speech Synthesis
The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by conditioning on the spectrogram. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform. Partway into writing efficient sampling code (generating waveforms using the Gaussian process state space representation), I realized that it’s actually quite easy to obtain waveforms if you’ve already got a spectrogram.
Sparse Gaussian Processes
Minimal Working Examples
2019
Gaussian Process Middle C
First of my experiments on audio modeling using Gaussian processes. Here, I construct a GP that, when sampled, plays middle c the way a grand piano would.
An IsingLike Model
… using Stan & HMC
Stochastic Bernoulli Probabilities
Consider: