Exact Densities for Non-Monotonic Transformations (draft)

This post was inspired by a late night epiphany that it should be possible to treat non-monotonic functions as piecewise monotonic and use the change of variables technique to obtain densities… and lo, you can:

If two random variables \(X\) and \(Y\) are related as \(Y = g(X)\), where \(g\) is a differentiable non-monotonic (but piecewise monotonic) function, and if the density of \(X\) is known to be \(p_X\), then the density of \(Y\) can be expressed as:

… where the sum is over all values \(x_i\) in the preimage set corresponding to \(y\).

References:

  1. Kobayashi, Mark and Turin’s Probability, Random Processes, and Statistical Analysis
  2. wikipedia

Proof of the Exact Density

From the KMT book, one (handwavy way) to prove the exact result is as follows:

For any given length \(y + \delta y\), there are multiple lengths in the preimage of the function\(g\), here, \(x_1 + \delta x_1, x_2 + \delta x_2, x_3 + \delta x_3\). Note that \(\delta x_2\) is negative.

Noting that \(x = g^{-1}(y)\) and that the second derivative term in the equation above is negative, the solution follows.

R Code for the Graph
### SETUP

x <- seq(-5, 5, length.out = 200)

f <- function(x) x^3 - 5*x + 2*cos(pi * x)
y <- Vectorize(f)(x) # Vectorize in case if broadcasting fails for more complex functions

### PLOT

plot(x, y, xaxt = "n", yaxt = "n", type = "l")
abline(v = c(-2.34, 0.26, 2.01), h = 0, lty = 3)
text(-2, -13.5, "x1 + δx1")
text(0.5, -13.5, "x2 + δx2")
text(2.1, -13.5, "x3 + δx3")
text(-2.6, 0.3, "y + δy")

The great thing deal about this method is that one has the density outright - this can then be fed into other MCMC samplers in combination with more complex models (which wouldn’t be the case with plain Monte Carlo).

The preimage set can be a pain to obtain, but one possible way is to get it is by fitting a spline to the function \(g\), breaking it up into monotonic pieces and inverting them. Both the smoothing spline and GPs are suitable here as they are differentiable and derivative values can be obtained explicitly.


Simulation

The function in this demonstration is:

Obtaining the density of \(Y\) and using MCMC to sample from it gives us a sample from \(Y\) but we could’ve used Monte Carlo simulation to obtain samples from the distribution as well. Comparing the two methods results in this QQ plot:

MCMC Code
### The code is messy at the moment, I will clean it up and vectorize later

library(ggplot2)

qplot(x, y, geom = "line", main = "Example Function")

mcmc <- function(target){

    X <- numeric(10000)

    X[1] <- runif(1)

    for(i in 2:length(X)){
        proposal <- rnorm(1, X[i - 1], 2)
        a <- target(proposal)/(target(X[i - 1]) + 0.00001)
        X[i] <- ifelse(runif(1) <= a, proposal, X[i - 1])
    }

    return(X)
    
}

fun_box <- function(){
    
    model <- smooth.spline(x, y) # fit the spline to the full data set
    derv <- as.data.frame(predict(model, x, deriv = 1)) # get the derivatives to identify monotonic regions
    
    ends <- cumsum(rle(derv$y >= 0)$lengths) # derv > 0 identifies monotonic runs and rle measures their length
    starts <- c(1, 1 + ends[-length(ends)]) # starts measures where runs start and ends records where they end
  
    n_segments <- length(ends) # number of monotonic segments in the line
    
    mods_list <- list()
    
    # fit a spline for each segment; y and x are reversed here as we're fitting a spline to the inverse function 
    for(i in 1:n_segments){
        s <- starts[i] # start point
        e <- ends[i] # start point
        mods_list[paste0("m", i)] <- list(spline(y[s:e], x[s:e]))
    }
    
    inv_func <- function(y_prop){

        rec_x <- numeric(n_segments)
        rec_d <- numeric(n_segments)

        # for each segment, predict the inverse and accept it only if the prediction lies within
        # the domain of the fit segment. E.g. t=if a segment was fit where x was in [0, 1], the
        # inverse shouldn't extrapolate outside [0, 1].
        for(i in 1:n_segments){
            rec_x[i] <- approxfun(mods_list[[i]])(y_prop)
            if(!is.na(rec_x[i])){
                rec_d[i] <- predict(model, rec_x[i], deriv = 1)$y
            }else{
                rec_x[i] <- 0
                rec_d[i] <- Inf
            }
        }

        return(list(x = rec_x, d = rec_d))
    }

    # finally calculate the density as in the equation above
    target <- function(y_prop){
        inverse_info <- inv_func(y_prop)
        return(sum(dnorm(inverse_info$x) / abs(inverse_info$d)))
    }

    return(mcmc(target))    
    
}

results <- fun_box()

qplot(results)

… it works!

The More Interesting Problem

We can use this methodology to answer a really interesting question: Is it possible to sample from \(X\) if the distribution of \(Y\) is known instead? (Yep!)

My first attempt before seeing the result displayed on this page was to run the Metropolis-Hastings by using just a single term in the density of \(X\), which was the density of \(Y\) multiplied its Jacobian at that point and surprisingly, this gives very good results anyway - saving the need to calculate the set of inverse values. Perhaps this is due to some kind of an averaging effect that the MCMC produces.

There’s a(n inelegant but more or less a correct way) of solving this problem. The density of \(Y\) for any \(X\) is known, and so is derivative information, so for every \(x\), we can calculate a \(y = g(x)\) and then invert the function - i.e. calculate the set \( \{ x_i: g(x_i) = y \} \). By doing this, one can obtain a set of linear equations for the density of \(X\) in terms of the density of \(Y\):

The way I’ve modeled this is by treating the CDF corresponding to the density \(p_X\) as an unknown function - a GP, with start and endpoints forced to \(0, 1\) and its derivative (the density, which will also be a GP) forced to be positive. I then use a fine grid on \(X\) and construct a large matrix of linear combinations according to the equation above:

… and set the observed difference between the LHS and RHS to zero.

I’m sure there’s a more elegant way of doing this, but the main takeaway is that this is possible. In more complex models, we now know exactly how to change a likelihood / loss function where dependencies between parameters is known beforehand.

Here’s a simulation; we find a density for \(X\) such that the density of \(Y\) is uniform over its range. The function is the same as the one shown above. Results:

mat_worker_two.py
# This is stright from my gaussian processes post with a small adjustment
# for reticulate (the lists are first formatted as np.array-s)

from autograd import numpy as np
from autograd import elementwise_grad as egrad

def K(xi, xj):
    """ Calculates the kernel matrix

    The domain values needn't be the same.
    Currently, the only supported dimension of each value of x is one.

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the kernel values applied over the grid (xi, xj).
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    if type(xi) is not float and type(xj) is not float: # if vectorization is possible:
        xi = xi.reshape(-1, 1) # reshape as a column vector
        xj = xj.reshape(1, -1) # reshape as a row vector
    return np.exp(-np.square(xi - xj)) # return kernel calculation

def dK(xi, xj):
    """ Returns the first kernel derivative matrix w.r.t. the second index

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the kernel derivative values applied over the grid (xi, xj).       
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    gradient = np.vectorize(egrad(K, 1)) # element-wise gradient along the second index
    xi = xi.reshape(-1, 1) # xi must be a column vector
    xj = xj.reshape(1, -1) # xj must be a row vector
    return gradient(xi, xj)

def dKt(xi, xj):
    """ Returns the transpose of Kernel Derivative Matrix by explicit computation

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the kernel derivative values applied over the grid (xj, xi).
    
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    gradient = np.vectorize(egrad(K, 0)) # element-wise gradient along the first index
    xi = xi.reshape(-1, 1) # xi must be a column vector
    xj = xj.reshape(1, -1) # xj must be a row vector
    return gradient(xi, xj)

def d2K(xi, xj):
    """ Returns the second kernel derivative matrix

    Args:
        xi: Sequence of domain values corresponding to the points changing over rows.
        xj: Sequence of domain values corresponding to the points changing over columns.

    Returns:
        numpy.array corresponding to the second kernel derivative applied over the grid (xi, xj).
    
    """
    if type(xi) is list or type(xj) is list:
            xi = np.array(xi)
            xj = np.array(xj)

    gradient = np.vectorize(egrad(egrad(K, 1), 0)) # two derivatives are taken
    xi = xi.reshape(-1, 1) # xi must be a column vector
    xj = xj.reshape(1, -1) # xj must be a row vector
    return gradient(xi, xj)
gp_density.R
# Need to tidy

library(rstan)
library(ggplot2)
library(reticulate)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

source_python("mat_worker_two.py") # Contains functions for calculating GP covariances

##### Initial Setup

f <- function(x) x^3 - 5*x + 2*cos(pi * x)   # the weird function from above
df <- function(x) 3*x^2 - 5 - 2*pi*sin(pi*x) # ... and its derivative

x <- seq(-3, 3, length.out = 100)            # note that the domain of x is bounded
y <- f(x)

##### Covariance Calculations

# Here, we calculate the covariance of a GP with lengthscale one
# such that the top half corresponds to the function and the bottom
# half corresponds to its derivative observations. The top half is 
# the CDF and the bottom, its density

S <- matrix(0, length(x) * 2, length(x) * 2)

# Structure of the matrix:
# 
#     | cov(f, f)    cov(f, f')  |
# S = |                          |
#     | cov(f', f)   cov(f', f') |
# 

n <- length(x)

S[      1:n    ,       1:n    ] <- K(x, x)   # cov(f, f)
S[      1:n    , (n + 1):(2*n)] <- dK(x, x)  # cov(f, f')
S[(n + 1):(2*n),       1:n    ] <- dKt(x, x) # cov(f', f)
S[(n + 1):(2*n), (n + 1):(2*n)] <- d2K(x, x) # cov(f', f')

S <- as.matrix(Matrix::nearPD(S)$mat) # nearest positive definite covar matrix

##### Inverse Function Calculations

# Here, we model the function as a set of monotonic segments using splines

model <- smooth.spline(x, y) 
derv <- as.data.frame(predict(model, x, deriv = 1)) # df of x and dg(x)/dx

ends <- cumsum(rle(derv$y >= 0)$lengths) # the index where each monotonic segment ends
starts <- c(1, 1 + ends[-length(ends)]) # the index where each monotonic segment starts

n_segments <- length(ends) # number of monotonic segments

mods_list <- list() # placeholder for the list of monotonic splines

for(i in 1:n_segments){
    s <- starts[i]
    e <- ends[i]

    # flip the x and y here to model the inverse (i.e. for an input y, you'd get
    # a corresponding x = g^(-1)(y) value).
    mods_list[paste0("m", i)] <- list(spline(y[s:e], x[s:e]))
}

inv_func <- function(y_prop){
# This function returns a list of x values and
# derivatives of g(x) where g(x) = y_prop

    rec_x <- numeric(n_segments)
    rec_d <- numeric(n_segments)

    for(i in 1:n_segments){
        # get inverse value of x in each segment
        rec_x[i] <- approxfun(mods_list[[i]])(y_prop)

        if(!is.na(rec_x[i])){
            # for each x in the correct segment, store the derivative at the location (x, y_prop)
            rec_d[i] <- predict(model, rec_x[i], deriv = 1)$y
        }else{
            # if the bounds on each segment were overstepped, approxfun returns NA
            rec_x[i] <- 0
            rec_d[i] <- Inf # we divide non-applicable densities by Inf, so they don't count
        }
    }

    return(list(x = round(rec_x, 3), d = rec_d))
}

##### Linear combination matrix

C <- matrix(0, length(x), length(x))

for(i in 1:length(x)){

    # for every x, we get the list of all other xs that could result in the same y
    temp_list <- inv_func(f(x[i]))

    for(j in 1:n_segments){
        index <- which.min(abs(x - temp_list$x[j])) # index locations of those xs
        C[i, index] <- 1/abs(temp_list$d[j]) # corresponding matrix values are set to 1/dg(x_j)
    }

}

model_string <- "
data {
    int N;
    vector[N] y_dens;   // density of y at g(vector x)
    vector[N] err_obs;  // difference between y_dens - linear_combs, set to zero
    matrix[N, N] C;     // linear combination matrix
    matrix[2*N, 2*N] S; // covariance matrix for GP
    matrix[N, N] err;   // diagonal matrix of near zero values for the observed erros covariances
}
parameters {
    vector<lower = 0, upper = 1>[N] p; // CDF
    vector<lower = 0>[N] f;            // density
}
transformed parameters {
    vector<lower = 0>[2*N] P; // concatenated (cdf, dens)
    vector<lower = 0>[2*N] mu; // mean of cdf, set to 0.5

    for(i in 1:N){
        P[i] = p[i];
        P[i + N] = f[i];
        mu[i] = 0.5;
        mu[i + N] = 0;
    }
}
model {
    P ~ multi_normal(mu, S); // GP for the cdf & density

    // cdf anchored at 0 and 1
    p[1] ~ normal(0, 0.00001);
    p[N] ~ normal(1, 0.00001);

    err_obs ~ multi_normal(y_dens - (C * f), err);
}
"

data <- list(N = length(x), S = S, err_obs = rep(0, length(x)),
    y_dens = dunif(f(x), min(f(x)), max(f(x))),
    C = C, err = diag(length(x)) * 1e-3)

model <- stan(model_code = model_string, iter = 1000, data = data)

results <- extract(model, pars = c("p", "f"))

par(mfrow = c(2, 1))

qplot(x, colMeans(results$p), main = "(Pointwise) Mean CDF", color = "Cumulative Probability",
    geom = "line", ylab = "Cum. Prob. / Density") +
    geom_line(data = data.frame(x = x, dens = colMeans(results$f)),
        mapping = aes(x = x, y = dens, color = "Density"))

target <- function(u) {
    i <- floor(runif(1, 1, 2000))

    if(abs(u) <= 2.99) 
        dens <- approxfun(x, results$f[i,])(u)
    else
        dens <- 0

    return(dens)
}

# simple metropolis hastings sampler
mcmc <- function(target){

    x <- numeric(100000); x[1] <- runif(1)

    for(i in 2:100000){
        proposal <- rnorm(1, x[i - 1], 2)

        a <- target(proposal)/(target(x[i - 1]) + 0.00001)
        x[i] <- ifelse(runif(1) <= a, proposal, x[i - 1])
    }

    return(x)

}

x_pred <- mcmc(target)

qplot(x_pred, fill = "x", xlab = "Sampled X Values", ylab = "Frequency",
    main = "Histogram of Sampled X Values")

qplot(f(x_pred), main = "Histogram of g(Sampled X)", fill = "blue",
    ylab = "Frequency", xlab = "g(Sampled X Values)", alpha = 0.7)

qplot(sort(f(x_pred)), sort(runif(100000, min(f(x)), max(f(x)))),
        main = "QQPlot of Predicted Vs Actual Y = g(X)", ylab = "Predicted",
        xlab = "Sampled through original method", color = 1) +
    geom_abline(intercept = 0, slope = 1)

We can use this method to set informative priors in situations where information is available about a complex transformation of the prior and not the prior itself. Perhaps this method can be used to obtain solutions for Bayesian inverse problems as well.


If you see any errors in the post or if you’d like to comment, please create an issue on GitHub.

2019

Gaussian Process Middle C

First of my experiments on audio modelling using gaussian processes. Here, I construct a GP that, when sampled, plays middle c the way a grand piano would.

~1 min read
Back to Top ↑

2018

Back to Top ↑