In financial applications one frequently comes across the need to draw samples according to an assumed distribution. This could be because one wants to simulate stock prices for a Monte Carlo simulation, to price an option payout or to generate risk scenarios. Applications arise also in the context of machine learning and partial filters.

To illustrate this let’s look at a simple example. Say we assume a total 30% drop with a chance of 20%, a 20% gain with a chance of 30% and a chance of 50% for no change.

Programming this in ‘R’, this could be expressed as vectors:

returns <- c(-0.3,0.0,0.2) probabilities <- c(0.2,0.5,0.3)

Next we normalize the probabilities to *1.* and work out the cumulative distribution, which in this example works out to the vector cw = [0.2,0.7, 1.0]

probabilities <- probabilities/sum(probabilities) cw <- cumsum(w) # cumulative distribution

Finally to generate Nsimulations=100,000 samples, we generate a uniform random number between 0 and 1 using the function ‘runif’, and look up the corresponding returns for ‘which’ the the first element of the cumulative distribution function cw exceeds the just generated uniform number.

Nsimulations <- 100000 simulated_returns <- sapply(runif(Nsimulations,0,1), function(x) (returns[min(which(cw-x>0))]))

We can check the output by plotting an histogram to verify that the frequency of the simulated returns is according to our targeted probabilities, comparing the mean an variance of the simulation to their theoretical expectations.

simulated vs. theoretical mean: 0.00022 0.00000 var: 0.02983 0.03000

hist(simulated_returns) theoretical_mean <- sum(returns*probabilities) theoretical_var <- sum(returns*returns*probabilities) - theoretical_mean^2 simulated_mean <- mean(simulated_returns) simulated_var <- var(simulated_returns) sprintf("simulated vs. theoretical") sprintf("mean: %2.5f %2.5f",simulated_mean, theoretical_mean) sprintf("var: %2.5f %2.5f",simulated_var, theoretical_var)

An interesting approach to sampling the distribution is the implementation below that follows Sebastian Thrun‘s stochastic particle filter. It effectively samples off the cumulative distribution. The first algorithm used the range between 0 and 1 for its uniform numbers to look up the corresponding return from the full range of the distribution function. The algorithm below picks a random starting point from the distribution function and then just takes twice the maximum of the probability buckets as a step size to pick the next simulation from that neighborhood. In the case of very large data sets the advantage is there is less time spend to seek over the full distribution function to get to and to pick the next sample.

Nsimulations <- 100000 returns <- c(-0.3,0.0,0.2) probabilities <- c(0.2,0.5,0.3) probabilities <- probabilities/sum(probabilities) simulated_returns <- c() max_probability <- max(probabilities) b <- 0. L = length(probabilities) i = floor(runif(1,1,L+1)) for( j in 1:Nsimulations) { b = runif(1,0,1+2*max_probability) while(probabilities[i]<b) { b = b - probabilities[i] i = (i ) %% L+1 } simulated_returns <- c(simulated_returns,returns[i]) } hist(simulated_returns)

Using this algorithm we get similar results

simulated vs. theoretical

mean: -0.00006 0.00000

var: 0.03006 0.03000