Introduction
The Gaussian Process is an incredibly interesting object.
Paraphrasing from Rasmussen & Williams 2006, a Gaussian Process is a stochastic process (a collection of random variables, indexed on some set) where any finite number of them have a multivariate normal distribution.
Sample paths of GPs can represent functions. They also have many different interpretations, e.g. in the sense of covariances or basis function representations.
An MVN is completely characterized by its mean and covariance, and in the case of a GP, we can use the covariance to specify what sort of a function space we want to look at. For example, to create a representation of a continuous function, let \(y_d\) be a vector of points corresponding to domain points \(x_{n,d}\). To enforce continuity, all we need is:
\[Cor(y_i, y_j) \to 0 \; as \;||x_{i} - x_j||_m \to 0_.\]Covariance between points is usually enforced using kernels, so for example, this is the RBF kernel:
\[Cov(y_i, y_j) = f(||x_{i} - x_j||) = \exp \left[ -\left( \sum_m (x_{m, i} - x_{m, i})^p \right)^{1/p} \right]\]This can be scaled (hence scaling the function along the y-axis), and a scale (the length-scale) can be applied to the domain variables. The function enforced by this kernel is infinitely differentiable, because the kernel itself is infinitely differentiable.
Knowing a few (“training”) points, one can predict all points between the training points (i.e. smoothing) by using the fact that, if \(x_2\) is the set of training points, the conditional distribution of the predicted points is also multivariate normal:
\[\begin{bmatrix}x_1\\x_2\end{bmatrix} \sim \mathcal N \left(\begin{bmatrix}\mu_1\\\mu_2\end{bmatrix}, \begin{bmatrix}\Sigma_{11} & \Sigma_{12}\\\Sigma_{21} & \Sigma_{22}\end{bmatrix} \right)\] \[\Rightarrow x_1|x_2 \sim \mathcal N \left( \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \right)\]In this case, \(\Sigma_{ij} = k({\bf x_i}, {\bf x_j})\), but interestingly, due to the linearity of the MVN, a derivative of a GP is also a GP assuming that the mean and covariance are differentiable, in which case, the kernels are:
\[\begin{bmatrix}x\\\frac{dx}{dt} \end{bmatrix} \sim \mathcal N \left(\begin{bmatrix}\mu\\ \frac{\partial \mu}{\partial t} \end{bmatrix}, \begin{bmatrix} K(t_i, t_j) & \frac{\partial K(t_i, t_j)}{\partial t_j} \\ \frac{\partial K(t_i, t_j)}{\partial t_i} & \frac{\partial^2 K(t_i, t_j)}{\partial t_i \partial t_j} \end{bmatrix} \right)\]Simulations
Prior Draw from a Gaussian Process
GP Prior Draw Code
# python / sklearn used to calculate the covariance matrix for efficiency
library(ggplot2)
library(mvtnorm)
library(reticulate) # let's one use python within R
k <- import("sklearn.gaussian_process.kernels")
x <- seq(-1, 1, length.out = 100)
dim(x) <- c(100, 1) # necessary to specify that the dimension of each x is 1
S <- k$RBF()(x, x) # RBF kernel with default length scale, 1., evaluated at x
y <- rmvnorm(1, sigma = S) # sample
x <- as.numeric(x) # restore the dimension (100,) for plotting
y <- as.numeric(y) # restore the dimension (100,) for plotting
qplot(x, y)

Posterior of a Gaussian Process
We fit a GP with an RBF kernel and lengthscale 1.0 to the points: \(\begin{bmatrix} [-0.5, -1] \\ [0, 0.5] \\ [0.5, 0] \end{bmatrix}\).
GP Posterior Code
# python / sklearn used to calculate the covariance matrix for efficiency
library(ggplot2)
library(mvtnorm)
library(reticulate) # let's one use python within R
k <- import("sklearn.gaussian_process.kernels")
x_test <- seq(-1, 1, length.out = 97) # prediction points
x_train <- c(-0.5, 0, 0.5) # training points
x <- c(x_test, x_train) # concatenate them according to the convention above
dim(x) <- c(length(x_test) + length(x_train), 1) # tell sklearn that the dimension of each x is 1
S <- k$RBF()(x, x) # efficient application of the kernel function to our 100x100 matrix
# following the convention of the formulae above:
S11 <- S[1:length(x_test), 1:length(x_test)]
S12 <- S[1:length(x_test), (length(x_test) + 1):length(x)]
S21 <- S[(length(x_test) + 1):length(x), 1:length(x_test)]
S22 <- S[(length(x_test) + 1):length(x), (length(x_test) + 1):length(x)]
y_obs <- matrix(c(-1, 0.5, 0), 3, 1) # this is x_2 in the equation above
mu <- 0 + S12 %*% solve(S22) %*% (y_obs - 0)
sigma <- S11 - S12 %*% solve(S22) %*% S21
mu <- as.numeric(mu)
sigma <- diag(sigma) # obtain point-wise posterior variances
qplot(x = x_test, y = mu, geom = "line", color = "predicted") +
geom_point(aes(x, y, color = "training"), data.frame(x = x_train, y = y_obs)) +
geom_ribbon(ymin = mu - 1.96*sigma, ymax = mu + 1.96*sigma, alpha = 0.25) +
labs(x = "x", y = "y")

Derivatives of the GP
I’ve written a piece to calculate the derivative kernels efficiently, but I’m still testing this:
GP Derivative Covariances (1-D)
#!/usr/bin/env python3
from autograd import numpy as np
from autograd import elementwise_grad as egrad
def K(xi, xj):
""" Calculates the kernel matrix
The domain values needn't be the same.
Currently, the only supported dimension of each value of x is one.
Args:
xi: Sequence of domain values corresponding to the points changing over rows.
xj: Sequence of domain values corresponding to the points changing over columns.
Returns:
numpy.array corresponding to the kernel values applied over the grid (xi, xj).
"""
if type(xi) is not float and type(xj) is not float: # if vectorization is possible:
xi = xi.reshape(-1, 1) # reshape as a column vector
xj = xj.reshape(1, -1) # reshape as a row vector
return np.exp(-np.square(xi - xj)) # return kernel calculation
def dK(xi, xj):
""" Returns the first kernel derivative matrix w.r.t. the second index
Args:
xi: Sequence of domain values corresponding to the points changing over rows.
xj: Sequence of domain values corresponding to the points changing over columns.
Returns:
numpy.array corresponding to the kernel derivative values applied over the grid (xi, xj).
"""
gradient = np.vectorize(egrad(K, 1)) # element-wise gradient along the second index
xi = xi.reshape(-1, 1) # xi must be a column vector
xj = xj.reshape(1, -1) # xj must be a row vector
return gradient(xi, xj)
def dKt(xi, xj):
""" Returns the transpose of Kernel Derivative Matrix by explicit computation
Args:
xi: Sequence of domain values corresponding to the points changing over rows.
xj: Sequence of domain values corresponding to the points changing over columns.
Returns:
numpy.array corresponding to the kernel derivative values applied over the grid (xj, xi).
"""
gradient = np.vectorize(egrad(K, 0)) # element-wise gradient along the first index
xi = xi.reshape(-1, 1) # xi must be a column vector
xj = xj.reshape(1, -1) # xj must be a row vector
return gradient(xi, xj)
def d2K(xi, xj):
""" Returns the second kernel derivative matrix
Args:
xi: Sequence of domain values corresponding to the points changing over rows.
xj: Sequence of domain values corresponding to the points changing over columns.
Returns:
numpy.array corresponding to the second kernel derivative applied over the grid (xi, xj).
"""
gradient = np.vectorize(egrad(egrad(K, 1), 0)) # two derivatives are taken
xi = xi.reshape(-1, 1) # xi must be a column vector
xj = xj.reshape(1, -1) # xj must be a row vector
return gradient(xi, xj)
This seems to work and I’ve used (previous versions of it, coupled with Tensorflow’s Adam and Stan’s HMC/l_BFGS) to solve differential equations:
\(\frac{dy}{dx} = sin(xy);\) \(y(0) = 1\)

Modeling SDE equivalents of GPs
There’s some great literature out there about modeling GPs as solutions of differential equations with a random component, but before I encountered that, the following was a brute-force attempt to model the functions \(\mu(X_t), \sigma(X_t)\) where \(X_t\) is a continuous time stochastic process and \(W_t\) is the standard Weiner process:
\[dX_t = \mu(X_t)dt + \sigma(X_t)dW_t\]When \(X_t\) is a Gaussian Process, equating the Euler-Maruyama representation of the SDE above with the GP expressed as \(LZ\) where \(L\) is the Cholesky-decomposition of the covariance matrix, results in the random normal vector \(Z\) being exactly equal to the random part of the SDE: \(dW_t = \sqrt{\Delta t} Z\).
Hence modeling the functions \(\mu, \sigma\) and minimizing the distance between \(Z, dW_t\) is a way to obtain those functions without solving the SDE. When this is mathematically infeasible, the algorithm fails.
Simo Särkkä has shown that GPs can be written as state space models, which are models of the form:
\[\dot {\bf x} = A {\bf x} + B {\bf u} \\ f = C {\bf x} + D {\bf u}\]For the centered Matern GP:
\[A = \begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ -\lambda^3 & -3 \lambda^2 & -3 \lambda \end{bmatrix}, B = \begin{bmatrix} 0 \\0 \\ 1 \end{bmatrix} \\ C = \begin{bmatrix} 1 & 0 & 0 \end{bmatrix}, D = \begin{bmatrix} 0 \end{bmatrix} \\ u \sim \mathcal N, \; \nu = 2.5, \; \lambda = \sqrt{2\nu}/l\]Python code for Matern SSM-GP Simulation
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import StateSpace
plt.ion(); plt.style.use("ggplot")
##### MATERN GAUSSIAN PROCESS PARAMETERS
l = np.sqrt(2 * 2.5) / 1.0
a = np.array([[0, 1, 0], [0, 0, 1], [-l**3, -3*l**2, -3*l]])
b = np.array([[0], [0], [1]])
c = np.array([[1, 0, 0]])
d = np.array([[0]])
#### STATE SPACE MODEL INPUTS
u = np.random.normal(size = 1000)
t = np.linspace(0, 10, 1000)
x0 = np.random.normal()
#### SIMULATE SSM
model = StateSpace(a, b, c, d)
t, f, x = model.output(u, t)
#### PLOT
plt.plot(t, x)
I’ve got the suspicion that the RBF GP can’t be represented this way (easily) because it’s infinitely differentiable; the Matern GP isn’t, so a derivative of a high enough order would appear to be random (which is set to \(u(t)\) and integrating that many times over would lead to a nice function. So, it’s probably unsurprising that the RBF needs to be written as its infinite series representation and then represented as a SSM.
There’s also interesting literature out there about gaussian convolutional models, which in some sense, represent moving-average counterparts of the autoregressive approach above.
2020
Astrophotography
I used to do a fair bit of astrophotography in university - it’s harder to find good skies now living in the city. Here are some of my old pictures. I’ve kept making rookie mistakes (too much ISO, not much exposure time, using a slow lens, bad stacking, …), for that I apologize!
Probabilistic PCA
I’ve been reading about PPCA, and this post summarizes my understanding of it. I took a lot of this from Pattern Recognition and Machine Learning by Bishop.
Modelling with Spotify Data
The main objective of this post was just to write about my typical workflow and views rather than come up with a great model. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary on making a model with it.
Random Stuff
Random Stuff
Random Projects
Random Projects
Morphing with GPs
The main aim here was to morph space inside a square but such that the transformation preserves some kind of ordering of the points. I wanted to use it to generate some random graphs on a flat surface and introduce spatial deformation to make the graphs more interesting.
SEIR Models
I had a go at a few SEIR models, this is a rough diary of the process.
Speech Synthesis
The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by conditioning on the spectrogram. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform. Partway into writing efficient sampling code (generating waveforms using the Gaussian process state space representation), I realized that it’s actually quite easy to obtain waveforms if you’ve already got a spectrogram.
Efficient Gaussian Process Computation
I’ll try to give examples of efficient gaussian process computation here, like the vec trick (Kronecker product trick), efficient toeliptz and circulant matrix computations, RTS smoothing and Kalman filtering using state space representations, and so on.
Sparse Gaussian Processes
Minimal Working Examples
2019
Gaussian Process Middle C
First of my experiments on audio modeling using Gaussian processes. Here, I construct a GP that, when sampled, plays middle c the way a grand piano would.
An Ising-Like Model
… using Stan & HMC
Stochastic Bernoulli Probabilities
Consider: