## Exact Densities for Non-Monotonic Transformations (draft)

2020 edit: This post is probably all sorts of wrong, haven’t checked it in years. You could achieve the aim of this post using a normalizing flow and it’d be a lot easier than what I wrote here.

This post was inspired by a late night thought that it should be possible to treat non-monotonic functions as piecewise monotonic and use the change of variables technique to obtain densities… and lo, you can:

If two random variables \(X\) and \(Y\) are related as \(Y = g(X)\), where \(g\) is a differentiable non-monotonic (but piecewise monotonic) function, and if the density of \(X\) is known to be \(p_X\), then the density of \(Y\) can be expressed as:

\[p_Y(y) = \sum_{x_i \in g^{-1}(y)} p_X(x_i) / \left| \frac{dg}{dx}(x_i) \right|\]… where the sum is over all values \(x_i\) in the preimage set corresponding to \(y\).

References:

### Proof of the Exact Density

From the KMT book, one (handwavy way) to prove the exact result is as follows:

For any given length \(y + \delta y\), there are multiple lengths in the preimage of the function\(g\), here, \(x_1 + \delta x_1, x_2 + \delta x_2, x_3 + \delta x_3\). Note that \(\delta x_2\) is negative.

\[P(y \leq Y \leq y + \delta y) = P(x_1 \leq X \leq x_1 + \delta x_1) + P(x_2 + \delta x_2 \leq X \leq x_2) + P(x_1 \leq X \leq x_1 + \delta x_1)\] \[\Rightarrow p_Y(y) \delta y \approx p_X(x_1) \delta x_1 - p_X(x_2) \delta x_2 + p_X(x_3) \delta x_3\] \[\Rightarrow p_Y(y) = p_X(x_1) \frac{dx}{dy} \Big |_{x_ 1} - p_X(x_2) \frac{dx}{dy} \Big |_{x_ 2} + p_X(x_3) \frac{dx}{dy} \Big |_{x_ 3}\]Noting that \(x = g^{-1}(y)\) and that the second derivative term in the equation above is negative, the solution follows.

## R Code for the Graph

The great thing deal about this method is that one has the density outright - this can then be fed into other MCMC samplers in combination with more complex models (which wouldn’t be the case with plain Monte Carlo).

The preimage set can be a pain to obtain, but one possible way is to get it is by fitting a spline to the function \(g\), breaking it up into monotonic pieces and inverting them. Both the smoothing spline and GPs are suitable here as they are differentiable and derivative values can be obtained explicitly.

### Simulation

The function in this demonstration is:

\[Y = X^3 - 5X + 2\cos(\pi X)\]Obtaining the density of \(Y\) and using MCMC to sample from it gives us a sample from \(Y\) but we could’ve used Monte Carlo simulation to obtain samples from the distribution as well. Comparing the two methods results in this QQ plot:

## MCMC Code

… it works!

## The More Interesting Problem

We can use this methodology to answer a really interesting question: * Is it possible to sample from \(X\) if the distribution of \(Y\) is known instead?* (Yep!)

My first attempt before seeing the result displayed on this page was to run the Metropolis-Hastings by using just a single term in the density of \(X\), which was the density of \(Y\) multiplied its Jacobian at that point and surprisingly, this gives very good results anyway - saving the need to calculate the set of inverse values. Perhaps this is due to some kind of an averaging effect that the MCMC produces.

There’s a(n inelegant but more or less a correct way) of solving this problem. The density of \(Y\) for any \(X\) is known, and so is derivative information, so for every \(x\), we can calculate a \(y = g(x)\) and then invert the function - i.e. calculate the set \(\\{ x_i: g(x_i) = y \\}\). By doing this, one can obtain a set of linear equations for the density of \(X\) in terms of the density of \(Y\):

\[p_Y(g(x)) = \sum_{x_i \in g^{-1}(y)} \hat p_X(x_i) / \left| \frac{dg}{dx}(x_i) \right|\]The way I’ve modeled this is by treating the CDF corresponding to the density \(p_X\) as an unknown function - a GP, with start and endpoints forced to \(0, 1\) and its derivative (the density, which will also be a GP) forced to be positive. I then use a fine grid on \(X\) and construct a large matrix of linear combinations according to the equation above:

\[\begin{bmatrix} p_Y(g(x_1)) \\ p_Y(g(x_2)) \\ . \end{bmatrix} = \begin{bmatrix} |g'(x_1)|^{-1} & 0 & ... & |g'(x_7)|^{-1} & ... \\ 0 & |g'(x_2)|^{-1} & ... & 0 & ... \\ . & . & . & . & . & \end{bmatrix} \begin{bmatrix} \hat p_X(x_1) \\ \hat p_X(x_2) \\ . \end{bmatrix}\]… and set the observed difference between the LHS and RHS to zero.

I’m sure there’s a more elegant way of doing this, but the main takeaway is that this is possible. In more complex models, we now know exactly how to change a likelihood / loss function where dependencies between parameters is known beforehand.

Here’s a simulation; we find a density for \(X\) such that the density of \(Y\) is uniform over its range. The function is the same as the one shown above. Results:

## mat_worker_two.py

## gp_density.R

We can use this method to set informative priors in situations where information is available about a complex transformation of the prior and not the prior itself. Perhaps this method can be used to obtain solutions for Bayesian inverse problems as well.

*If you see any errors in the post or if you’d like to comment, please create an issue on GitHub.*

## 2020

### Astrophotography

I used to do a fair bit of astrophotography in university - it’s harder to find good skies now living in the city. Here are some of my old pictures. I’ve kept making rookie mistakes (too much ISO, not much exposure time, using a slow lens, bad stacking, …), for that I apologize!

### Probabilistic PCA

I’ve been reading about PPCA, and this post summarizes my understanding of it. I took a lot of this from Pattern Recognition and Machine Learning by Bishop.

### Modelling with Spotify Data

The main objective of this post was just to write about my typical workflow and views rather than come up with a great model. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary on making a model with it.

### Random Stuff

## Random Stuff

### Random Projects

# Random Projects

### Morphing with GPs

The main aim here was to morph space inside a square but such that the transformation preserves some kind of ordering of the points. I wanted to use it to generate some random graphs on a flat surface and introduce spatial deformation to make the graphs more interesting.

### SEIR Models

I had a go at a few SEIR models, this is a rough diary of the process.

### Speech Synthesis

The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by * conditioning on the spectrogram*. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform. Partway into writing efficient sampling code (generating waveforms using the Gaussian process state space representation), I realized that it’s actually quite easy to obtain waveforms if you’ve already got a spectrogram.

### Efficient Gaussian Process Computation

I’ll try to give examples of efficient gaussian process computation here, like the vec trick (Kronecker product trick), efficient toeliptz and circulant matrix computations, RTS smoothing and Kalman filtering using state space representations, and so on.

### Sparse Gaussian Processes

## Minimal Working Examples

## 2019

### Gaussian Process Middle C

First of my experiments on audio modeling using Gaussian processes. Here, I construct a GP that, when sampled, plays middle c the way a grand piano would.

### An Ising-Like Model

## … using Stan & HMC

### Stochastic Bernoulli Probabilities

Consider: