The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by conditioning on the spectrogram. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform.

A spectrogram or a power spectral density (PSD) of a signal \(X_t\) is the absolute value squared of its Fourier transform.

There seem to be approximate PSD inversion algorithms (e.g. the Griffin-Lim) to obtain waveforms from spectrograms. There are also methods based on nonnegative matrix least squares which approximately generate waveforms conditioned on mel-spectrograms. An example is given below:

from scipy.io import wavfile as wav
from librosa.feature import melspectrogram
from librosa.feature.inverse import mel_to_audio

audio = mel_to_audio(melspectrogram(audio))
wav.write("audio.wav", sampling_rate, np.repeat(np.array(audio, dtype = "int16"), 2).reshape(-1, 2))

The previous iteration of the GP work is given below. It’s interesting that ideas I had about fitting GPs in the frequency domain instead of the time domain, and the representation of speech as a long GP aren’t unheard of - these were rediscoveries, and the state space modeling, signal processing and psychoacoustic side of things here was very interesting to read about.


The idea is that we fit a Gaussian process to some waveforms in the frequency domain rather than the time domain. This also showcases the power of SymPy - to get the analytical form of the spectral densities using the covariance function, I use SymPy to do the Fourier transform. This is original work - I drew from Richard Turner’s thesis for the amplitude demodulation part, Andrew Gordon Wilson’s papers for the spectral kernel and the LJSpeech dataset for the speech data. This was a learning exercise - it’s not meant to be groundbreaking, however, a motivation is to create a simple model for speech that should give reasonable results on modest hardware.

Synthesized audio sample:


Currently, this code takes in an audio file, fits piecewise stationary GPs to each 5ms block and synthesizes a waveform from these GPs. Earlier commits of this page may have interesting bits of code written and discarded during the initial write-up (e.g. tests).

Code

Imports and inits:

# Author: Aditya Ravuri
# conda activate gpsynth; ipython

import sympy as sy
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm import tqdm, trange
import matplotlib.pyplot as plt
from scipy.linalg import toeplitz
from scipy.io import wavfile as wav

sy.init_printing()
md = sy.functions.Abs
sess = tf.InteractiveSession()
plt.style.use("ggplot"); plt.ion()

Now, I define a spectrum function (mainly to learn how it works). An estimate of the specral density of a process is the absolute value squared of the DFT of the process.

def spectrum(x, u = 1, return_freq = True):
	n = len(x)
	intm = np.fft.fft(x)
	intm = (intm * intm.conjugate()).real/n
	intm = intm * u/n 
	psd = intm[1:int(np.ceil(0.5 * (n + 1)))]
	if return_freq:
		freq = range(1, len(psd) + 1) 
		freq = np.array(freq, dtype = float)
		freq *= 0.5*n / (u * len(psd))
		return freq, psd
	else:
		return psd

Here, I define a function that takes in a multivariate normal covariance matrix, and returns a conditional covariance matrix of the process, if the condition were that the start and end of the process were exactly at zero. This is mainly for convenience as later, I synthesize the speech process in stationary blocks. Concatenating the blocks would make it discontinuous, hence this conditioning. With this, the synthesis can be parallelized but this probably introduces artifacts in the spectrum of the full process.

def condition_at_zero(mat):
	n = mat.shape[0]
	indices = np.hstack([range(1, n - 1), 0, n - 1])
	mat = mat[indices, :]
	mat = mat[:, indices]

	K11 = mat[:(n - 2), :(n - 2)]
	K12 = mat[:(n - 2), (n - 2):]
	K22 = mat[(n - 2):, (n - 2):]
	return K11 - (K12 @ np.linalg.inv(K22) @ K12.T)

This function implements an autoregressive process. If a GP is too long to be synthesized in one go, this function breaks up the sequence into blocks and conditions the next block’s distribution on the last block.

def condition_on_prv(kern, seg = None):
	n = len(kern)//2
	K11 = K22 = toeplitz(kern)
	K12 = K11[:n, n:].T; K11 = K22 = K11[:n, :n]
	K22i = np.linalg.inv(K22 + np.identity(n)*1e-7)
	K1222i = K12 @ K22i
	C = np.linalg.cholesky(K11[:n, :n] - K1222i @ K12.T + np.identity(n)*1e-7)
	if seg is None:
		Cb = np.linalg.cholesky(K11[:n, :n] + np.identity(n)*1e-7)
		seg = Cb @ np.random.normal(size = n)
	return seg, (K1222i @ seg) + (C @ np.random.normal(size = n))

The main program starts here. First, read the (LJSpeech) data.

if __name__ == "__main__":

	life_sr, life = wav.read("/Users/adityaravuri/data/LJSpeech-1.1/wavs/LJ001-0001.wav")

	life = np.array(life) if len(life.shape) == 1 else np.array(life[:, 0], dtype = "float64")
	life_scale = life.std()
	life = life/life_scale
	n_t = len(life)
	t = n_t/life_sr

Model the audio as a product of an envelope as in Richard Turner’s thesis and a stationary part that contains the frequencies. This part wasn’t strictly necessary as the inference procedure can fit the standard deviation of the process in each 5ms block.

	envelope = np.array(pd.Series(life).rolling(2500).std(center = True).fillna(0.004))
	envelope_div = envelope.copy()
	envelope_div[envelope_div <= 0.2] = 0.2

	stn = life/envelope_div

Split the audio into 5ms blocks.

	segm_resolution = 0.05
	segm_len = int(segm_resolution*(n_t - 2)/t); segm_n = n_t//segm_len
	segments = tuple([stn[(segm_len*i):(segm_len*(i + 1))] for i in range(segm_n)])

Calculate the Fourier transform of the theoretical covariance function symbolically (which is the symbolic spectral density). Also remember, Fourier transforms are linear operators, so the FT of a sum of covariances is a sum of spectral densities. Here, we fit a 24 component spectral mixture kernel.

	r, w = sy.symbols("r, w", real = True)
	s, p, l = sy.symbols("s, p, l", real = True, positive = True)

	K = s**2 * sy.exp(-(r/l)**2) * sy.cos(2*p*sy.pi*r)
	S = sy.fourier_transform(K, r, w)
	S = sy.lambdify([w, l, p, s], S, "tensorflow")
	K = sy.lambdify([r, l, p, s], K)
	W = s**2
	W = sy.lambdify(s, W, "tensorflow")

Humans can’t hear all frequencies equally, and roughly, within some bands of frequencies, we can’t tell apart two similar frequencies. Also, some (0.5-5kHz are way more important than other frequencies). Interestingly, in the presence of strong frequencies, lower power frequencies may not be heard - this is a consideration for the future. Here, we initiate the spectral mixture within the bands that the psychoacoustics community has suggested as one of the older ways to define the bands.

	bark_scale = [20, 100, 200, 300, 400, 510, 630, 770, 920, 1080,
				  1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400,
				  5300, 6400, 7700, 9500, 12000, 15500, np.infty]
	bark_scale = pd.DataFrame(dict(cuts_end = bark_scale))
	bark_scale['cuts_start'] = bark_scale.cuts_end.shift(1, fill_value = 0.0)
	bark_scale.reset_index(inplace = True)

	identifier = lambda x: bark_scale[bark_scale.cuts_end >= x].iloc[0, 0]

	v = lambda x: tf.exp(tf.Variable(x))

Initiate the variables. freq_obs is a vector of observed frequencies, spec_obs are the observed spectra corresponding to these frequencies. p_ps are the periodicity parameters, l_ps are the lengthscale parameters and s_ps are the scale parameters of the spectral kernel:

\[K(r) = s^2 * \exp(-r^2/l^2) * \cos(\pi * r * p)\]
	freq_obs, _ = spectrum(segments[0], t * len(segments[0])/n_t)
	spec_obs = np.vstack([spectrum(seg, t * len(seg)/n_t, False) for seg in segments])
	freq_obs, spec_obs = [np.array(i, dtype = "float32") + 1e-30 for i in [freq_obs, spec_obs]]

	obs_spectrum = pd.DataFrame(dict(f = freq_obs))
	obs_spectrum['index'] = obs_spectrum.f.map(identifier)
	def maxer(spec):
		obs_spectrum['s'] = spec
		obs_spectrum['s_prime'] = obs_spectrum.groupby('index')['s'].transform('max')
		return np.array(obs_spectrum.s_prime, dtype = "float32")
	spec_obs_max = np.vstack([maxer(spec_obs[i, :]) for i in range(segm_n)])

	sigma = 1/(1 + v(3.))

	p_ps = np.array(np.tile(freq_obs, segm_n).reshape(segm_n, len(freq_obs))[spec_obs_max == spec_obs], dtype = "float32").reshape(segm_n, -1, order = 'C')
	n_sel = ps_non_trainable = p_ps.shape[1]
	l_ps = 1/(1 + v(np.array(np.random.uniform(1, 3, size = p_ps.shape), dtype = "float32")))
	s_ps = 1/(1 + v(np.array(np.random.uniform(-0.2, 0.2, size = p_ps.shape), dtype = "float32")))

Vectorize loss (this should really be replaced with the proper, e.g. Whittle, likelihood).

	theo_spec = tf.reduce_sum(S(
			freq_obs.reshape(1, 1, -1),
			tf.reshape(l_ps, p_ps.shape + (1, )),
			p_ps.reshape(p_ps.shape + (1, )),
			tf.reshape(s_ps, p_ps.shape + (1, ))),
		axis = 1) + W(sigma) + 1e-30

	loss = tf.reduce_sum((tf.log(spec_obs) - tf.log(theo_spec))**2) # likelihood is too unstable
	loss += tf.reduce_sum(((tf.log(spec_obs) - tf.log(theo_spec))[spec_obs == spec_obs_max])**2)*250
	loss += tf.reduce_sum((tf.log(theo_spec[1:, :]) - tf.log(theo_spec[:-1, :]))**2)*0.1
	step = tf.train.AdamOptimizer(0.005).minimize(loss = loss)

	sess.run(tf.initialize_all_variables())

Optimize.

	looper = trange(40001, desc = 'ML')
	for i in looper:
		sess.run(step)
		if i % 500 == 0: looper.set_description('Loss: %g' % sess.run(loss))

	l_ps = sess.run(l_ps); s_ps = sess.run(s_ps); sigma = np.array(sess.run(sigma), dtype = 'float32')

	kern_grid = np.linspace(0, segm_resolution, segm_len)

	theo_kern = np.sum(K(kern_grid.reshape(1, 1, -1), l_ps.reshape(p_ps.shape + (1, )),
		p_ps.reshape(p_ps.shape + (1, )), s_ps.reshape(p_ps.shape + (1, ))), axis = 1) + sigma**2 + 1e-30

	def sim(row):
		C = toeplitz(row)
		C += np.identity(segm_len)*1e-10
		C = condition_at_zero(C)
		C += np.identity(segm_len - 2)*1e-10
		C = np.linalg.cholesky(C)
		z = np.random.normal(size = segm_len - 2)
		return np.hstack([0.0, (C @ z).reshape(-1), 0.0])

Synthesize.

	audio = np.hstack([sim(theo_kern[i, :]) for i in tqdm(range(segm_n))])
	audio_to_write = np.array(life_scale * audio * envelope[:len(audio)], dtype = "int16")
	wav.write("life_synth.wav", life_sr, np.repeat(audio_to_write, 2).reshape(-1, 2))

2021

Gaussian Processes in MGCV

I lay out the canonical GP interpretation of MGCV’s GAM parameters here. Prof. Wood updated the package with stationary GP smooths after a request. Running through the predict.gam source code in a debugger, the computation of predictions appears to be as follows:

~1 min read

Photogrammetry

I wanted to see how easy it was to do photogrammetry (create 3d models using photos) using PyTorch3D by Facebook AI Research.

1 min read

Dead Code & Syntax Trees

This post was motivated by some R code that I came across (over a thousand lines of it) with a bunch of if-statements that were never called. I wanted an automatic way to get a minimal reproducing example of a test from this file. While reading about how to do this, I came across Dead Code Elimination, which kills unused and unreachable code and variables as an example.

~1 min read
Back to Top ↑

2020

Astrophotography

I used to do a fair bit of astrophotography in university - it’s harder to find good skies now living in the city. Here are some of my old pictures. I’ve kept making rookie mistakes (too much ISO, not much exposure time, using a slow lens, bad stacking, …), for that I apologize!

1 min read

Probabilistic PCA

I’ve been reading about PPCA, and this post summarizes my understanding of it. I took a lot of this from Pattern Recognition and Machine Learning by Bishop.

1 min read

Spotify Data Exploration

The main objective of this post was just to write about my typical workflow and views. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary working with the data.

4 min read

Random Stuff

For dealing with road/city networks, refer to Geoff Boeing’s blog and his amazing python package OSMnx. Go to Shapely for manipulation of line segments and other objects in python, networkx for networks in python and igraph for networks in R.

4 min read

Morphing with GPs

The main aim here was to morph space inside a square but such that the transformation preserves some kind of ordering of the points. I wanted to use it to generate some random graphs on a flat surface and introduce spatial deformation to make the graphs more interesting.

1 min read

Speech Synthesis

The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by conditioning on the spectrogram. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform.

4 min read
Back to Top ↑

2019

Back to Top ↑

2018

Back to Top ↑