The main objective of this post was just to write about my typical workflow and views rather than come up with a great model. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary on making a model with it.

Objective: Predict the probability of skipping a track given a song and the user journey.

Before we dive in, here are a couple of interesting reads from the Spotify blog. This article is a good read on deep learning and collaborative filtering at Spotify. This article is an interesting one about the ML pipeline that’s geared towards the problems Spotify faces. Finally, this article shows one of Spotify’s data discovery systems.

Data Management & Platform

The data that I’m using is the Spotify Streaming Sessions Dataset.

Before I got into the task and modelling, I needed to figure out how to manage the data.

I’m running an Ubuntu Server (18.04) on a PC that I built two years ago. I’ve got 12gigs of RAM and around 30gigs of space that I can dedicate to this project on an SSD. At the moment, the file systems I’m using are ext4 and exfat (I’ve heard about using OpenZFS for postgres and the Hadoop HDFS but I’m not going to try these out here because I don’t know much about them). If this was for production, I’d think a lot more about data redundancy, etc.

The data is around 50GB and sits as a collection of CSV files in a tar.gz. Initially, I considered using a postgres database and I copied over two CSVs (using R’s DBI::dbWriteTable), but the table size (as seen in /var/lib/postgresql/12/main/base/) was larger than the CSV, so I’d go way over budget in terms of storage if I were to use postgres as is.

Instead, I opted for Spark. Spark/Sparklyr (an R API for Spark) supports lazy-reading of compressed CSVs and commands are optimized even when working on a single node. If I use tensorflow/keras for this project, I’ll be able to write a generator using pandas for batch processing, so this setup should be sufficient for exploration and modeling (Spark also supports GLMs and some other stats/ML models).

Memory issues shouldn’t be a problem as long as queries are written in a smart way - that said I’ll still use a sample of the sessions data (around 2% of it) for exploration because it’s just easier to work with. I also had to get the right version of the Java SDK working before Spark worked out. I’ll use Spark and the full data later after exploration.

(Click to Expand) R Code for Data Check

config = spark_config()
config$`` <- '15G'
config$spark.executor.memory <- '15G'
config$spark.memory.fraction <- 0.9

data_path = '/media/training_set/zipped/'
java_path = '/usr/lib/jvm/java-8-openjdk-amd64/'
spark_path = '/home/aditya/spark/spark-2.4.3-bin-hadoop2.7/'

Sys.setenv(JAVA_HOME = java_path)

sc = spark_connect(master = 'local', spark_home = spark_path)
data = spark_read_csv(sc, path = data_path, memory = F)

skips_by_song = data %>%
	group_by(track_id_clean, not_skipped) %>%
	summarize(count = n()) %>%

I’ll also detail some bash commands that I normally use. Commands that I used in this section:

  • ssh -Y aditya@local_ip: SSH with X11 forwarding
  • psql: PostgreSQL Terminal
  • sudo (-u): To establish dominance (as another user)
  • sudo mount /dev/sda7 /media: Mount external SSD to /media
  • sudo fdisk -l: Disk information
  • ls -lha: List all, in a human friendly way
  • cd -: Go back to previous directory
  • apropos: Help with command, list of similar commands
  • sudo chmod -R 777 media/: CAUTION advised, make all files read/write-able by everyone
  • tar -xvf and tar -tvf file.tar.gz > file_list: Extract file, list files in tar archive
  • find ~ -maxdepth 5 -name x*: Find file containing ‘x’

Initial Exploration

The tracks dataframe contains an id, duration, release year, popularity estimate, auditory features (e.g. acousticness, danceability) and some kind of latent representation (acoustic_vector). All these are numeric.

The sessions dataframe is described in the dataset paper so I won’t repeat it here, but it’s basically a bunch of sessions with multiple tracks with information about whether or not they were skipped.

Read data

tracks <- dir('track_features', full.names = T) %>%
	lapply(fread) %>% rbindlist

sessions <- dir('training_set', full.names = T) %>%
	lapply(fread) %>% rbindlist

setnames(sessions, 'track_id_clean', 'track_id')
sessions[, date := as.Date(date)]
sessions <- merge(sessions, tracks, all.x = T, by = 'track_id')

Below, I’ve rounded track duration and plotted average acousticness scores by track duration.

Example scatterplot code
plot_frame <- tracks[, .(mean_acousticness = mean(acousticness)),
					 by = .(duration = round(duration * 2)/2)]

p <- ggplot(plot_frame)
p <- p + aes(x = duration, y = mean_acousticness)
p <- p + geom_point(alpha = 0.1)
p <- p + xlim(NA, 1000)
p <- p + ylim(0.25, 0.75)
p <- p + labs(x = 'Duration', y = 'Mean Acousticness')

Most plots I normally look at are scatterplots, histograms, autocorrelation estimates and spectra. Filtering, group-bys and aggregating usually gets you a long way to understanding distributions, conditional distributions and first and second order effects (mean dependences and correlations).


A note about scatterplots because I love them so much. Typically, the kind of effect I’m looking for on a scatterplot is a graph dependence that looks like (2D) or (3D).

What this means is that, I’m looking for the effect of one variable (like time, age - something that’s usually given, which is not as such stochastic) on another that is stochastic. As an example, this can leads us to see what the function may possibly look like.

A graph based on data is like an estimator - there’s an underlying function that you want to discover and you see a lot of noise on top of this. In the example above - we see mean acoustic scores by duration. For longer durations, there’s fewer data points, so we see heteroskedasticity. The underlying function though, it’s makes sense for it to be continuous (maybe even smooth).

Be very careful when plotting stochastic variables on each other - I personally feel very uncomfortable seeing plots of , where these are time series. This is because, we usually only see one observation of the time series - and with autocorrelation and non-stationarity, one can see spurious stuff. Try plotting two stationary (and long-lengthscale) GPs against each other; half the time you’d see positive correlations and half the time negative. Irl, we only see one observation of the processes so we can’t tell if the correlation we see would be there after a million alternate samples. If the processes are not stationary, the plot becomes hopeless.

In the plot above, we’re dealing with an undirected graph that looks (I’m guessing). Moreover, we’ve got many independent observations for both variables. Mean acousticness as a function of duration seems more prominent though.

There are times where plots of stochastic variables on each other are super helpful. Below is an example of a plot I made (using plotly, colored by time) with the LANL data (I can’t remember what the rv.s were, but it was something like ). You can clearly see a deterministic relationship.

Below, is a list of histograms corresponding to variables in the tracks dataframe.

An interesting thing about this data is that we’ve got four million independent observations of a bunch of acoustic features of different songs. Given the independence, we can try to figure out the graph (CPDAG) that might be behind the variables.

I used an R implementation of the PC algorithm to work out a possible graph that might’ve generated this data. There is an assumption here about multivariate normality, so I log-transformed some of the fatter-tailed variables before working out the correlation matrix. I used a networkD3 script from here to make a nice viz of the CPDAG although it really needs to be refined.

PC alg code for CPDAG

track_samples = copy(tracks)
track_samples[, track_id := NULL]
track_samples[, mode := NULL]
track_samples[, duration := log(duration + 1e-10)]
track_samples[, release_year := log(release_year + 1e-10)]
track_samples[, acousticness := log(acousticness + 1e-10)]
track_samples[, dyn_range_mean := log(dyn_range_mean + 1e-10)]
track_samples[, flatness := log(flatness + 1e-10)]
track_samples[, instrumentalness := log(instrumentalness + 1e-10)]
track_samples[, liveness := log(liveness + 1e-10)]
track_samples[, speechiness := log(speechiness + 1e-10)]
node_names = colnames(track_samples)
correl_mat = cor(track_samples)

graph = pc(suffStat = list(C = correl_mat, n = nrow(track_samples)),
		   indepTest = gaussCItest, alpha=1e-10, labels = node_names)

adjm = wgtMatrix(getGraph(graph), transpose = FALSE)
gD = igraph:::graph.adjacency(adjm)

There are ton of interesting insights here, I won’t go into them, but as an example, acousticness is a parent to dyn_range_mean, energy, organism, acoustic_vector_1, acoustic_vector_3 and acoustic_vector_7, but depends on nothing. It’s quite weird that duration is a parent of release_year though!

There are a bunch of nonlinearities in the relationships between these variables and not everything is gaussian (most variables aren’t) so this graph is approximate, but is interesting nonetheless.

Now, we move onto the sessions data.

Sessions exploration code
unique_sessions <- sessions[, unique(session_id)]
popular_songs <- sessions[, .N, by = track_id][order(-N)]
popular_songs <- popular_songs[1:10000, track_id]

# acoustic_songs <- tracks[track_id %in% popular_songs][order(-acousticness), track_id]

# Transitions between songs

# had memory issues here, garbage collection - gc() was used often
trans_mat <- copy(sessions[track_id %in% popular_songs,
				.(session_id, session_position, track_id)])
setorder(trans_mat, 'session_id', 'session_position')
trans_mat[, pos_next := shift(session_position, type = 'lead'), by = session_id]
trans_mat[, track_next := shift(track_id, type = 'lead'), by = session_id]
trans_mat <- trans_mat[pos_next - session_position == 1,
	.(track_id = factor(track_id, levels = popular_songs, ordered = T),
	  track_next = factor(track_next, levels = popular_songs, ordered = T)
trans_mat <- as.matrix(trans_mat[, table(track_id, track_next)])
dimnames(trans_mat) <- NULL

num_songs <- 50
image(x = 1:num_songs, y = 1:num_songs, log(trans_mat[1:num_songs, 1:num_songs] + 1),
	xlab = 'Popularity Rank of Current Song', ylab = 'Popularity Rank of Next Song',
	main = 'Log Transitions between Top 400 Songs')

# Seasonality


plot_frame <- function(variable) {
	var <- substitute(variable)
	return(sessions[, .(p = mean(not_skipped)),
		by = eval(var)][order(var)])

plots <- list()

plots[[1]] <- ggplot(plot_frame(wday(date, T))) +
	geom_point(aes(x = var, y = p)) +
	labs(x = 'Weekday', y = 'Prob of Skip')

plots[[2]] <- ggplot(plot_frame(mday(date))) +
	geom_line(aes(x = var, y = p), linetype = 2) +
	labs(x = 'Day of Month', y = 'Prob of Skip')

plots[[3]] <- ggplot(plot_frame(hour_of_day)) +
	geom_line(aes(x = var, y = p), linetype = 2) +
	labs(x = 'Hour of Day', y = 'Prob of Skip')

plots[[4]] <- ggplot(plot_frame(session_position)) +
	geom_line(aes(x = var, y = p), linetype = 2) +
	labs(x = 'Session Position', y = 'Prob of Skip')

gridExtra::grid.arrange(grobs = plots, nrow = 2)

# Memory in Chains

trans_df <- copy(sessions[, .(session_id, session_position, danceability, not_skipped)])
setorder(trans_df, 'session_id', 'session_position')
trans_df[, session_position := NULL]
trans_df[, dance_prv := shift(danceability, type = 'lag'), by = session_id]
trans_df <- trans_df[!]
trans_df <- trans_df[, .(p = mean(not_skipped), .N),
	by = .(dance_prv = round(dance_prv, 2), dance_now = round(danceability, 2))]

plotly::plot_ly(trans_df[N > 100],
	x = ~dance_prv, y = ~dance_now, z = ~log(prob_not_skip),
	marker = list(size = 1, color = 'purple'))

p <- ggplot(trans_df[N > 100, .(dance_prv, dance_now,
	prob_not_skip = as.factor(round(prob_not_skip, 1)))])
p <- p + aes(x = dance_prv, y = dance_now)
p <- p + geom_tile(aes(fill = prob_not_skip))
p <- p + scale_fill_brewer(palette = 'Spectral')

We see that there are cut offs at round popularity figures, so perhaps a lot of transitions are happening due to Top 50 or similar lists. More popular songs seem to transition between themselves. There are also some interesting checkerboard-type patterns, where perhaps some songs feed between themselves a lot, maybe due to being similar in some way. There’s also the obvious mass along the diagonal - lots of songs play repeatedly. Making the same plot by acousticness doesn’t reveal anything very interesting.

Predictably, there’s seasonality in the skip series, and the acousticness of tracks played, and other variables. Some dates are weird and obviously wrong, I’ll ignore these in the future. The autocorrelation in the probability of skips by day of month series might be an apparent effect due to weekday and time of day effects.

Click here for a 3D plot demonstrating memory in the listening chains. This clearly shows a positive association between danceability of the current song and the preferred danceability of the next song. Furthermore, if the danceability of the current song is low, a low danceability is highly preferred in the next song, which makes sense. If the plot is too big, here’s a heatmap that achieves the same effect:

More soon



I used to do a fair bit of astrophotography in university - it’s harder to find good skies now living in the city. Here are some of my old pictures. I’ve kept making rookie mistakes (too much ISO, not much exposure time, using a slow lens, bad stacking, …), for that I apologize!

1 min read

Probabilistic PCA

I’ve been reading about PPCA, and this post summarizes my understanding of it. I took a lot of this from Pattern Recognition and Machine Learning by Bishop.

1 min read

Modelling with Spotify Data

The main objective of this post was just to write about my typical workflow and views rather than come up with a great model. The structure of this data is also outside my immediate domain so I thought it’d be fun to write up a small diary on making a model with it.

6 min read

Morphing with GPs

The main aim here was to morph space inside a square but such that the transformation preserves some kind of ordering of the points. I wanted to use it to generate some random graphs on a flat surface and introduce spatial deformation to make the graphs more interesting.

2 min read

SEIR Models

I had a go at a few SEIR models, this is a rough diary of the process.

4 min read

Speech Synthesis

The initial aim here was to model speech samples as realizations of a Gaussian process with some appropriate covariance function, by conditioning on the spectrogram. I fit a spectral mixture kernel to segments of audio data and concatenated the segments to obtain the full waveform. Partway into writing efficient sampling code (generating waveforms using the Gaussian process state space representation), I realized that it’s actually quite easy to obtain waveforms if you’ve already got a spectrogram.

5 min read

Efficient Gaussian Process Computation

I’ll try to give examples of efficient gaussian process computation here, like the vec trick (Kronecker product trick), efficient toeliptz and circulant matrix computations, RTS smoothing and Kalman filtering using state space representations, and so on.

4 min read
Back to Top ↑


Gaussian Process Middle C

First of my experiments on audio modeling using Gaussian processes. Here, I construct a GP that, when sampled, plays middle c the way a grand piano would.

1 min read
Back to Top ↑


Back to Top ↑