Exact Densities for Non-Monotonic Transformations (draft)

2020 edit: This post is probably all sorts of wrong, haven’t checked it in years. You could achieve the aim of this post using a normalizing flow and it’d be a lot easier than what I wrote here.

This post was inspired by a late night thought that it should be possible to treat non-monotonic functions as piecewise monotonic and use the change of variables technique to obtain densities… and lo, you can:

If two random variables \(X\) and \(Y\) are related as \(Y = g(X)\), where \(g\) is a differentiable non-monotonic (but piecewise monotonic) function, and if the density of \(X\) is known to be \(p_X\), then the density of \(Y\) can be expressed as:

\[p_Y(y) = \sum_{x_i \in g^{-1}(y)} p_X(x_i) / \left| \frac{dg}{dx}(x_i) \right|\]

… where the sum is over all values \(x_i\) in the preimage set corresponding to \(y\).

References:

  1. Kobayashi, Mark and Turin’s Probability, Random Processes, and Statistical Analysis
  2. wikipedia

Proof of the Exact Density

From the KMT book, one (handwavy way) to prove the exact result is as follows:

For any given length \(y + \delta y\), there are multiple lengths in the preimage of the function\(g\), here, \(x_1 + \delta x_1, x_2 + \delta x_2, x_3 + \delta x_3\). Note that \(\delta x_2\) is negative.

\[P(y \leq Y \leq y + \delta y) = P(x_1 \leq X \leq x_1 + \delta x_1) + P(x_2 + \delta x_2 \leq X \leq x_2) + P(x_1 \leq X \leq x_1 + \delta x_1)\] \[\Rightarrow p_Y(y) \delta y \approx p_X(x_1) \delta x_1 - p_X(x_2) \delta x_2 + p_X(x_3) \delta x_3\] \[\Rightarrow p_Y(y) = p_X(x_1) \frac{dx}{dy} \Big |_{x_ 1} - p_X(x_2) \frac{dx}{dy} \Big |_{x_ 2} + p_X(x_3) \frac{dx}{dy} \Big |_{x_ 3}\]

Noting that \(x = g^{-1}(y)\) and that the second derivative term in the equation above is negative, the solution follows.

R Code for the Graph
### SETUP

x <- seq(-5, 5, length.out = 200)

f <- function(x) x^3 - 5*x + 2*cos(pi * x)
y <- Vectorize(f)(x) # Vectorize in case if broadcasting fails for more complex functions

### PLOT

plot(x, y, xaxt = "n", yaxt = "n", type = "l")
abline(v = c(-2.34, 0.26, 2.01), h = 0, lty = 3)
text(-2, -13.5, "x1 + δx1")
text(0.5, -13.5, "x2 + δx2")
text(2.1, -13.5, "x3 + δx3")
text(-2.6, 0.3, "y + δy")

The great thing deal about this method is that one has the density outright - this can then be fed into other MCMC samplers in combination with more complex models (which wouldn’t be the case with plain Monte Carlo).

The preimage set can be a pain to obtain, but one possible way is to get it is by fitting a spline to the function \(g\), breaking it up into monotonic pieces and inverting them. Both the smoothing spline and GPs are suitable here as they are differentiable and derivative values can be obtained explicitly.


Simulation

The function in this demonstration is:

\[Y = X^3 - 5X + 2\cos(\pi X)\]

Obtaining the density of \(Y\) and using MCMC to sample from it gives us a sample from \(Y\) but we could’ve used Monte Carlo simulation to obtain samples from the distribution as well. Comparing the two methods results in this QQ plot:

MCMC Code
### The code is messy at the moment, I will clean it up and vectorize later

library(ggplot2)

qplot(x, y, geom = "line", main = "Example Function")

mcmc <- function(target){

    X <- numeric(10000)

    X[1] <- runif(1)

    for(i in 2:length(X)){
        proposal <- rnorm(1, X[i - 1], 2)
        a <- target(proposal)/(target(X[i - 1]) + 0.00001)
        X[i] <- ifelse(runif(1) <= a, proposal, X[i - 1])
    }

    return(X)
    
}

fun_box <- function(){
    
    model <- smooth.spline(x, y) # fit the spline to the full data set
    derv <- as.data.frame(predict(model, x, deriv = 1)) # get the derivatives to identify monotonic regions
    
    ends <- cumsum(rle(derv$y >= 0)$lengths) # derv > 0 identifies monotonic runs and rle measures their length
    starts <- c(1, 1 + ends[-length(ends)]) # starts measures where runs start and ends records where they end
  
    n_segments <- length(ends) # number of monotonic segments in the line
    
    mods_list <- list()
    
    # fit a spline for each segment; y and x are reversed here as we're fitting a spline to the inverse function 
    for(i in 1:n_segments){
        s <- starts[i] # start point
        e <- ends[i] # start point
        mods_list[paste0("m", i)] <- list(spline(y[s:e], x[s:e]))
    }
    
    inv_func <- function(y_prop){

        rec_x <- numeric(n_segments)
        rec_d <- numeric(n_segments)

        # for each segment, predict the inverse and accept it only if the prediction lies within
        # the domain of the fit segment. E.g. t=if a segment was fit where x was in [0, 1], the
        # inverse shouldn't extrapolate outside [0, 1].
        for(i in 1:n_segments){
            rec_x[i] <- approxfun(mods_list[[i]])(y_prop)
            if(!is.na(rec_x[i])){
                rec_d[i] <- predict(model, rec_x[i], deriv = 1)$y
            }else{
                rec_x[i] <- 0
                rec_d[i] <- Inf
            }
        }

        return(list(x = rec_x, d = rec_d))
    }

    # finally calculate the density as in the equation above
    target <- function(y_prop){
        inverse_info <- inv_func(y_prop)
        return(sum(dnorm(inverse_info$x) / abs(inverse_info$d)))
    }

    return(mcmc(target))    
    
}

results <- fun_box()

qplot(results)

… it works!

The More Interesting Problem

We can use this methodology to answer a really interesting question: Is it possible to sample from \(X\) if the distribution of \(Y\) is known instead? (Yep!)

My first attempt before seeing the result displayed on this page was to run the Metropolis-Hastings by using just a single term in the density of \(X\), which was the density of \(Y\) multiplied its Jacobian at that point and surprisingly, this gives very good results anyway - saving the need to calculate the set of inverse values. Perhaps this is due to some kind of an averaging effect that the MCMC produces.

There’s a(n inelegant but more or less a correct way) of solving this problem. The density of \(Y\) for any \(X\) is known, and so is derivative information, so for every \(x\), we can calculate a \(y = g(x)\) and then invert the function - i.e. calculate the set \(\\{ x_i: g(x_i) = y \\}\). By doing this, one can obtain a set of linear equations for the density of \(X\) in terms of the density of \(Y\):

\[p_Y(g(x)) = \sum_{x_i \in g^{-1}(y)} \hat p_X(x_i) / \left| \frac{dg}{dx}(x_i) \right|\]

The way I’ve modeled this is by treating the CDF corresponding to the density \(p_X\) as an unknown function - a GP, with start and endpoints forced to \(0, 1\) and its derivative (the density, which will also be a GP) forced to be positive. I then use a fine grid on \(X\) and construct a large matrix of linear combinations according to the equation above:

\[\begin{bmatrix} p_Y(g(x_1)) \\ p_Y(g(x_2)) \\ . \end{bmatrix} = \begin{bmatrix} |g'(x_1)|^{-1} & 0 & ... & |g'(x_7)|^{-1} & ... \\ 0 & |g'(x_2)|^{-1} & ... & 0 & ... \\ . & . & . & . & . & \end{bmatrix} \begin{bmatrix} \hat p_X(x_1) \\ \hat p_X(x_2) \\ . \end{bmatrix}\]

… and set the observed difference between the LHS and RHS to zero.

I’m sure there’s a more elegant way of doing this, but the main takeaway is that this is possible. In more complex models, we now know exactly how to change a likelihood / loss function where dependencies between parameters is known beforehand.

Here’s a simulation; we find a density for \(X\) such that the density of \(Y\) is uniform over its range. The function is the same as the one shown above. Results:

mat_worker_two.py
# This is stright from my gaussian processes post with a small adjustment
# for reticulate (the lists are first formatted as np.array-s)

from autograd import numpy as np
from autograd import elementwise_grad as egrad

def K(xi, xj):
    """ Calculates the kernel matrix

    The domain values needn't be the same.
    Currently, the only supported dimension of each value of x is one.

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the kernel values applied over the grid (xi, xj).
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    if type(xi) is not float and type(xj) is not float: # if vectorization is possible:
        xi = xi.reshape(-1, 1) # reshape as a column vector
        xj = xj.reshape(1, -1) # reshape as a row vector
    return np.exp(-np.square(xi - xj)) # return kernel calculation

def dK(xi, xj):
    """ Returns the first kernel derivative matrix w.r.t. the second index

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the kernel derivative values applied over the grid (xi, xj).       
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    gradient = np.vectorize(egrad(K, 1)) # element-wise gradient along the second index
    xi = xi.reshape(-1, 1) # xi must be a column vector
    xj = xj.reshape(1, -1) # xj must be a row vector
    return gradient(xi, xj)

def dKt(xi, xj):
    """ Returns the transpose of Kernel Derivative Matrix by explicit computation

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the kernel derivative values applied over the grid (xj, xi).
    
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    gradient = np.vectorize(egrad(K, 0)) # element-wise gradient along the first index
    xi = xi.reshape(-1, 1) # xi must be a column vector
    xj = xj.reshape(1, -1) # xj must be a row vector
    return gradient(xi, xj)

def d2K(xi, xj):
    """ Returns the second kernel derivative matrix

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the second kernel derivative applied over the grid (xi, xj).
    
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    gradient = np.vectorize(egrad(egrad(K, 1), 0)) # two derivatives are taken
    xi = xi.reshape(-1, 1) # xi must be a column vector
    xj = xj.reshape(1, -1) # xj must be a row vector
    return gradient(xi, xj)
gp_density.R
# Need to tidy

library(rstan)
library(ggplot2)
library(reticulate)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

source_python("mat_worker_two.py") # Contains functions for calculating GP covariances

##### Initial Setup

f <- function(x) x^3 - 5*x + 2*cos(pi * x)   # the weird function from above
df <- function(x) 3*x^2 - 5 - 2*pi*sin(pi*x) # ... and its derivative

x <- seq(-3, 3, length.out = 100)            # note that the domain of x is bounded
y <- f(x)

##### Covariance Calculations

# Here, we calculate the covariance of a GP with lengthscale one
# such that the top half corresponds to the function and the bottom
# half corresponds to its derivative observations. The top half is 
# the CDF and the bottom, its density

S <- matrix(0, length(x) * 2, length(x) * 2)

# Structure of the matrix:
# 
#     | cov(f, f)    cov(f, f')  |
# S = |                          |
#     | cov(f', f)   cov(f', f') |
# 

n <- length(x)

S[      1:n    ,       1:n    ] <- K(x, x)   # cov(f, f)
S[      1:n    , (n + 1):(2*n)] <- dK(x, x)  # cov(f, f')
S[(n + 1):(2*n),       1:n    ] <- dKt(x, x) # cov(f', f)
S[(n + 1):(2*n), (n + 1):(2*n)] <- d2K(x, x) # cov(f', f')

S <- as.matrix(Matrix::nearPD(S)$mat) # nearest positive definite covar matrix

##### Inverse Function Calculations

# Here, we model the function as a set of monotonic segments using splines

model <- smooth.spline(x, y) 
derv <- as.data.frame(predict(model, x, deriv = 1)) # df of x and dg(x)/dx

ends <- cumsum(rle(derv$y >= 0)$lengths) # the index where each monotonic segment ends
starts <- c(1, 1 + ends[-length(ends)]) # the index where each monotonic segment starts

n_segments <- length(ends) # number of monotonic segments

mods_list <- list() # placeholder for the list of monotonic splines

for(i in 1:n_segments){
    s <- starts[i]
    e <- ends[i]

    # flip the x and y here to model the inverse (i.e. for an input y, you'd get
    # a corresponding x = g^(-1)(y) value).
    mods_list[paste0("m", i)] <- list(spline(y[s:e], x[s:e]))
}

inv_func <- function(y_prop){
# This function returns a list of x values and
# derivatives of g(x) where g(x) = y_prop

    rec_x <- numeric(n_segments)
    rec_d <- numeric(n_segments)

    for(i in 1:n_segments){
        # get inverse value of x in each segment
        rec_x[i] <- approxfun(mods_list[[i]])(y_prop)

        if(!is.na(rec_x[i])){
            # for each x in the correct segment, store the derivative at the location (x, y_prop)
            rec_d[i] <- predict(model, rec_x[i], deriv = 1)$y
        }else{
            # if the bounds on each segment were overstepped, approxfun returns NA
            rec_x[i] <- 0
            rec_d[i] <- Inf # we divide non-applicable densities by Inf, so they don't count
        }
    }

    return(list(x = round(rec_x, 3), d = rec_d))
}

##### Linear combination matrix

C <- matrix(0, length(x), length(x))

for(i in 1:length(x)){

    # for every x, we get the list of all other xs that could result in the same y
    temp_list <- inv_func(f(x[i]))

    for(j in 1:n_segments){
        index <- which.min(abs(x - temp_list$x[j])) # index locations of those xs
        C[i, index] <- 1/abs(temp_list$d[j]) # corresponding matrix values are set to 1/dg(x_j)
    }

}

model_string <- "
data {
    int N;
    vector[N] y_dens;   // density of y at g(vector x)
    vector[N] err_obs;  // difference between y_dens - linear_combs, set to zero
    matrix[N, N] C;     // linear combination matrix
    matrix[2*N, 2*N] S; // covariance matrix for GP
    matrix[N, N] err;   // diagonal matrix of near zero values for the observed erros covariances
}
parameters {
    vector<lower = 0, upper = 1>[N] p; // CDF
    vector<lower = 0>[N] f;            // density
}
transformed parameters {
    vector<lower = 0>[2*N] P; // concatenated (cdf, dens)
    vector<lower = 0>[2*N] mu; // mean of cdf, set to 0.5

    for(i in 1:N){
        P[i] = p[i];
        P[i + N] = f[i];
        mu[i] = 0.5;
        mu[i + N] = 0;
    }
}
model {
    P ~ multi_normal(mu, S); // GP for the cdf & density

    // cdf anchored at 0 and 1
    p[1] ~ normal(0, 0.00001);
    p[N] ~ normal(1, 0.00001);

    err_obs ~ multi_normal(y_dens - (C * f), err);
}
"

data <- list(N = length(x), S = S, err_obs = rep(0, length(x)),
    y_dens = dunif(f(x), min(f(x)), max(f(x))),
    C = C, err = diag(length(x)) * 1e-3)

model <- stan(model_code = model_string, iter = 1000, data = data)

results <- extract(model, pars = c("p", "f"))

par(mfrow = c(2, 1))

qplot(x, colMeans(results$p), main = "(Pointwise) Mean CDF", color = "Cumulative Probability",
    geom = "line", ylab = "Cum. Prob. / Density") +
    geom_line(data = data.frame(x = x, dens = colMeans(results$f)),
        mapping = aes(x = x, y = dens, color = "Density"))

target <- function(u) {
    i <- floor(runif(1, 1, 2000))

    if(abs(u) <= 2.99) 
        dens <- approxfun(x, results$f[i,])(u)
    else
        dens <- 0

    return(dens)
}

# simple metropolis hastings sampler
mcmc <- function(target){

    x <- numeric(100000); x[1] <- runif(1)

    for(i in 2:100000){
        proposal <- rnorm(1, x[i - 1], 2)

        a <- target(proposal)/(target(x[i - 1]) + 0.00001)
        x[i] <- ifelse(runif(1) <= a, proposal, x[i - 1])
    }

    return(x)

}

x_pred <- mcmc(target)

qplot(x_pred, fill = "x", xlab = "Sampled X Values", ylab = "Frequency",
    main = "Histogram of Sampled X Values")

qplot(f(x_pred), main = "Histogram of g(Sampled X)", fill = "blue",
    ylab = "Frequency", xlab = "g(Sampled X Values)", alpha = 0.7)

qplot(sort(f(x_pred)), sort(runif(100000, min(f(x)), max(f(x)))),
        main = "QQPlot of Predicted Vs Actual Y = g(X)", ylab = "Predicted",
        xlab = "Sampled through original method", color = 1) +
    geom_abline(intercept = 0, slope = 1)

We can use this method to set informative priors in situations where information is available about a complex transformation of the prior and not the prior itself. Perhaps this method can be used to obtain solutions for Bayesian inverse problems as well.


If you see any errors in the post or if you’d like to comment, please create an issue on GitHub.

2020

Astrophotography

I used to do a fair bit of astrophotography in university - it’s harder to find good skies now living in the city. Here are some of my old pictures. I’ve kept making rookie mistakes (too much ISO, not much exposure time, using a slow lens, bad stacking, …), for that I apologize!

1 min read

Probabilistic PCA

I’ve been reading about PPCA, and this post summarizes my understanding of it. I took a lot of this from Pattern Recognition and Machine Learning by Bishop.

1 min read

Modelling with Spotify Data

The main objective of this post was just to write about my typical workflow and views rather than come up with a great model. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary on making a model with it.

5 min read

Morphing with GPs

The main aim here was to morph space inside a square but such that the transformation preserves some kind of ordering of the points. I wanted to use it to generate some random graphs on a flat surface and introduce spatial deformation to make the graphs more interesting.

1 min read

SEIR Models

I had a go at a few SEIR models, this is a rough diary of the process.

3 min read

Speech Synthesis

The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by conditioning on the spectrogram. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform. Partway into writing efficient sampling code (generating waveforms using the Gaussian process state space representation), I realized that it’s actually quite easy to obtain waveforms if you’ve already got a spectrogram.

4 min read

Efficient Gaussian Process Computation

I’ll try to give examples of efficient gaussian process computation here, like the vec trick (Kronecker product trick), efficient toeliptz and circulant matrix computations, RTS smoothing and Kalman filtering using state space representations, and so on.

4 min read
Back to Top ↑

2019

Gaussian Process Middle C

First of my experiments on audio modeling using Gaussian processes. Here, I construct a GP that, when sampled, plays middle c the way a grand piano would.

~1 min read
Back to Top ↑

2018

Back to Top ↑