Celebrating PI Day with String Art and D3

This week local artisians have been busy with string art projects to celebrate 3/14/2018, which was an inspiration to create a string art maker application that would help to explore shapes and allow to generate design templates.

String art is closely related to Bezier curves that are widely used in computer graphics and animation. Curves and shapes emerge out of a construction process that involves connecting a set of pairwise nodes by lines. In the example here we consider nodes to be placed along the perimeter of a circle and numbered from 0 to N-1. Following a simple construction rule, such as connecting the i-th node with the node given by 21*i interesting patterns emerge.


The application depict below allows for easy exploration of construction expressions and patterns. Push the blue CALC button in the app below to create the graphics. To open the app in the browser use the https://string-circle.firebaseapp.com link.


The code below demonstrates the use of D3 to create the string art graphics. It creates N nodes 0≤i<N that are paired with a node j by a function fj(i) = 21*i+N/2, where j is mapped back to the range 0≤j<N. The graph makes use of the d3.transition() function to create an animation effect in order to better visualize the construction process.

Unfortunately the wordpress blog does not seem to allow active javascript postings, for a demonstration of the javascript code in action see the dublicated posting as an active   observablehq notebook.

let N=300
let T=3000
let height = 500
let width = 500

let fj = (i)=>{ return (((21*i+N/2)%N+N)%N); }

let fx = (i) => { return (Math.cos(2*Math.PI*i/N) * (width/2)+width/2); }
let fy = (i) => { return (-Math.sin(2*Math.PI*i/N) * (height/2)+height/2); }
let svg = DOM.svg(width,height);
let nodes = Array(N).fill().map((_,i)=>i)
let colors = [“cyan”,”pink”,”magenta”,”blue”]
let icolor = 0;
function selectColor(i){
icolor = (icolor+1)%colors.length
return colors[icolor]

let g = d3.select(svg).append(“g”)
.attr(“x1”,(d)=> fx(d) )
.attr(“y1”,(d)=> fy(d) )
.attr(“x2”,(d)=> fx(fj(d)) )
.attr(“y2”,(d)=> fy(fj(d)) )
.attr(“stroke-width”, 1)
.on(“start”, function repeat() {
.attr(“stroke”, (i)=>{return selectColor(i);})
.duration((i)=> {return(50+(T-50)*(N-i)/N)})
.on(“start”, repeat)

Posted in Uncategorized | Leave a comment

Fun with R and graphs on the dawn of 2014

There is some secret message hidden in this graph


Let’s decode it


# Can you traverse the graph of letters to decode the secret message that decrypt's as
# hashcode(message) = 2055263931398168510404

# Function to convert a string to a hash number 
hashcode <- function(s){ 
  sapply(s, function(si){sum(as.numeric(charToRaw(si))*31^( (1:nchar(si))-1)) } )}

# Define a list of vertices and neighbors
# For example from P one can reach P and Y
edges <- list("W"=c(" "),
              "Y"=c(" ","E"),
              "R"=c(" "),
              " "=c("N","H","Y"),

# reformat edgelist into a matrix of paired vertices that is expected by the igraph graphing package
el <-        (do.call(rbind, 
              (Reduce(append,lapply(names(edges), function(e) 
              { lapply(edges[[e]], function(ei){ matrix(c(e,ei),nrow=1)})})) ) ))

# create graph, for plotting only
g <- graph.edgelist(el)
# color edges
E(g)$color <- "grey"

# Paths are represented by strings, for example going from "Y" to "F" and "W" would be "YFW"
# Let's start traversing the path from three starting nodes H, N and Y
paths <- c("H","N","Y")

# Function that looks up the current node given a path
# for example currnode("YFW") returns "W"
currnode <- function(s){substring(s,nchar(s),nchar(s))}

# Starting from the initial points, traverse graph by appending the next vertices 
# according to the edge list until a path length of 13 steps
# For example currnode("PYF") returns "F"  
# edges[["F"]] returns "A" and "W"
# which creates paths "PYFA" and "PYFW"

# Run for 13 steps
while(nchar(paths[[1]])<=13) {
  paths <- Reduce(append,lapply(paths,function(p){ lapply(
    edges[[currnode(p)]] , function(nxt){paste0(p,nxt)} ) } ))

# Check if we found the 'secret message' ?
HNY <- which(hashcode(paths)==2055263931398168510404)
# Print the decoded message
msg <- paths[[HNY]]
#print all paths

# color the solution path 
# (use <<- to assign the color within mapply)
path <- unlist(strsplit(msg,""))
  E(g, path=c(x,y))$color <<- col}, 
                 path[-1],heat.colors(length(path)-1) ))

Resulting in


Wishing everyone a Happy New Year 2014, cheers!

Posted in Uncategorized | Tagged , , | 1 Comment

Portfolio Trading

In finance and investing the term portfolio refers to the collection of assets one owns. Compared to just holding a single asset at a time a portfolio has a number of potential benefits. A universe of asset holdings within the portfolio gives a greater access to potentially favorable trends across markets. At the same time the expected risk and return of the overall holding is subject to its specific composition.

A previous post on this blog illustrated a portfolio selection methodology that screens a universe of possible assets. The object there was to find a particular combination that reduces the expected risk-return ratio (sharp-ratio). The assumption there was an underlying buy-and-hold strategy, that once a particular portfolio is selected its composition is hold static until the final position unwound.

This post concerns with dynamic portfolio trading strategies where the portfolio is periodically rebalanced. At prescribed intervals the composition of the portfolio is changed by selling out existing holdings and buying new assets. Typically it is assumed that reallocation does not change the overall portfolio value, with the possible exception of losses due to transaction costs and fees. Another classification concerning this type of strategies is if borrowings are allowed (long-short portfolio) or if all relative portfolio weights are restricted to be positive (long-only portfolio).

Following closely recent publications by Steven Hoi and Bin Li we first set the stage by illustrating fundamental portfolio mechanics and then detailing their passive aggressive mean reversion strategy. We test drive the model in R and test it with 1/2011-11/2012 NYSE and NASDAQ daily stock data.

Portfolio Trading Terminology

Let’s say we want to invest a notional amount of say \bf P(t=0) = \$ 1,000 into a portfolio and record each day the total portfolio value \bf P(t). The portfolio is composed of members with a per share dollar price of \bf S_i(t). Assuming we hold \bf \Delta_i(t) shares of each component, then the relative portfolio weight for each member is

\bf b_i(t) = \frac{\Delta^{\$}_i(t)}{P(t)} = \frac{\Delta_i(t) S_i(t)}{P(t)}.

The dollar delta \bf \Delta^{\$}_i(t) = \Delta_i(t) S_i(t) expresses the value of each stock holding. Since the aggregated portfolio value is the sum over its components \bf P(t) = \sum_i \Delta_i(t) S_i(t) , the relative portfolio weights are always normalized to \bf \sum_i b_i(t) = 1

Buy and Hold

An instructive example is to look at ideal perfectly antithetic pair of artificial price paths. Both assumes assets alternate between a 100% and 50% return. A static portfolio would not experience any growth after a full round trip is completed. The column \bf x shows the individual relative open to close or close to next day’s close returns.

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-open \bf \$ 200 \atop \$ 50 \bf  \atop  \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000

Dollar Cost Averaging

Dollar cost averaging is a classic strategy that takes advantage of periodic mean reversion. After each close rebalance the portfolio by distributing equal dollar amounts between each component. After the rebalance trades each delta dollar position \bf \Delta^{\$}(t) in the portfolio is the same at the open for each component. Rebalancing itself does not change the total portfolio value. Expressed in terms of the relative portfolio weights \bf b dollar cost averaging is the allocation strategy to rebalance the weight back to the uniform value of the inverse number of components \bf N inside the portfolio.

Dollar cost averaging:
\bf b_i = 1/N

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop  \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-open \bf \$ 200 \atop \$ 50 \bf \atop \bf 0.5 \atop 0.5 \bf 3.125 \atop 12.5 \bf \$ 650 \atop \$ 650 \bf \$ 1,250
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0.2 \atop 0.8 \bf 3.125 \atop 12.5 \bf \$ 312.5 \atop \$ 1,250 \bf \$ 1562.5

Perfect Prediction

With the benefit of perfect foresight its is possible to allocate all weights into the single asset with highest predicted gain.

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop  \bf 1  \atop 0 \bf 10 \atop 0 \bf \$ 1,000 \atop \$ 0 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 1 \atop 0 \bf 10 \atop 0 \bf \$ 2,000 \atop \$ 0 \bf \$ 2,000
\bf 2-open \bf \$ 200 \atop \$ 50 \bf \atop \bf 0 \atop 1 \bf 0 \atop 40 \bf \$ 0 \atop \$ 2,000 \bf \$ 2,000
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0 \atop 1 \bf 0 \atop 40 \bf \$ 0 \atop \$ 4,000 \bf \$ 4,000

Mean Reversion Strategy

The one period portfolio return is given by the product of relative weights and component relative returns as

\bf L(b,x) = \sum b_i x_i

The task of the trading strategy is to determine the rebalancing weights \bf b^{open}(t) for the opening position of day \bf t given yesterdays relative returns at the close \bf x^{close}(t-1) .
A forecast of a positive trend would be that today’s returns are a repeat of yesterdays \bf x^{close}(t) = x^{close}(t-1). In contrast mean reversion would assume that today’s returns are the inverse of yesterdays \bf x_i^{close}(t) = 1 / x_i^{close}(t-1).

The following strategy consists of determining the weights \bf b such that a positive trend assumption would lead to a loss \bf L(b^{open}(t),x^{close}(t-1)) = L_0 with threshold parameter \bf L_0 \leq 1.

Considering returns \bf \Delta x = x-L_0 with

\bf L(b,x) = \sum b_i x_i = L_0
\bf = L_0 \sum_i b_i = \sum b_i (x_i - L_0)
\bf = \sum b_i \Delta x_i = 0

Typically the relative performance vector \bf \Delta x consists of a mix of growing and retracting assets. Then we have a balanced number of positive and negative components within the \bf \Delta x vector. The condition that the product \bf \sum b_i \Delta x_i = 0 chooses an weight vector that is perpendicular to \bf \Delta x and lies on the unit simplex \bf \sum b_i =1 .

Now assume a mean reverting process that consists of a rotation of performances within the asset vector. Graphically the corresponds to a reflection at a \bf 45 degree axis. After rotation the angle between the the weight vector and the new performance vectors reduces. This results in an increase of the portfolio return value.

Thus setting for example \bf L_0 = 1 generates a flat portfolio in case the performance vector does not change while it increases the portfolio value in the case of an actual mean reversion.

The strategy implicitly assumes a price process that rotates between outperforming and lagging components within the portfolio. This does not include the situation where all stocks are out or underperforming at the same time. The long-only portfolio strategy relies on the ability to be able to reallocate funds between winners and losers.

For the two share example we have

\bf b_1 = \frac{L_0-x_2}{x_1-x_2}
\bf b_2 = \frac{L_0-x_1}{x_2-x_1}

using \bf b_1+b_2=1 and \bf b_1 x_1 + b_2 x_2 = L_0

The table below shows the result of computing the portfolio weights at the open of day 2 according to above formulas.

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop  \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-open \bf \$ 200 \atop \$ 50 \bf \atop \bf 0.33 \atop 0.66 \bf 2.08 \atop 16.67 \bf \$ 416.67 \atop \$ 833.33 \bf \$ 1,250
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0.11 \atop 0.89 \bf 2.08 \atop 16.67 \bf \$ 208.33 \atop \$ 1,666,67 \bf \$ 1,875

With more than two assets the loss condition together with the normalization is not sufficient to uniquely determine the portfolio weights. The additional degrees of freedoms can be used to allow the posing of additional criteria one wishes the strategy to observe. A good choice seems to be requiring that the new portfolio weights are close to the previous selection, as this should minimize rebalance transaction costs.

Minimizing the squared distance between the weights under the normalization and loss constraint leads to the following expression for the new weight

\bf b(t) = b(t-1) - \tau(t-1) ( x(t-1) - {\bar x}(t-1) )

as a function of the previous portfolio weights \bf b(t-1) , the return vector \bf x(t-1) and the average return across the portfolio components \bf {\bar x}(t-1) = \frac{1}{n}\sum_i x_i(t-1) .

The control gain \tau is taken as the ratio of the access loss over the return variance

\bf \tau(t-1) = \frac{ max(0,b(t-1)x(t-1)-L_0 }{ var(x(t-1)) }


This post contains the R code to test the mean reversion strategy. Before running this the function block in the appendix need to be initialized. The following code runs the artificial pair of stocks as a demonstration of the fundamental strategy

# create the antithetic pair of stocks and publish it into the environment
# generate sequence of dates
times <- seq(as.POSIXct(strptime('2011-01-1 16:00:00', '%Y-%m-%d %H:%M:%S')),
                      as.POSIXct(strptime('2011-12-1 16:00:00', '%Y-%m-%d %H:%M:%S')),
                      by="1 months")

# generate and store dummy price series = 200,100,200,100,200 ...
prices.xts <- xts(rep(c(200,100),length(times)/2), order.by=as.POSIXct(times))
stk <- 'STK1'
colnames(prices.xts) <- paste(stk,'.Adjusted',sep="")
assign(x=stk, value=prices.xts, envir=global.env )

# generate and store dummy price series = 100,200,100,200,100 ...
prices.xts <- xts(rep(c(100,200),length(times)/2), order.by=as.POSIXct(times))
stk <- 'STK2'
colnames(prices.xts) <- paste(stk,'.Adjusted',sep="")
assign(x=stk, value=prices.xts, envir=global.env )

# run the strategy and create a plot
strategy(portfolio= c("STK1","STK2"),threshold=.9,doplot=TRUE,title="Artificial Portfolio Example")

As a benchmark I like always to testing against a portfolio of DJI member stocks using recent 2011 history. The strategy does not do very well for this selection of stocks and recent time frame.

As an enhancement screening stocks by ranking according to their most negative single one-day auto-correlation. The 5 best ranked candidates out of the DJI picked this way are “TRV”,XOM”,”CVX”,PG” and “JPM” .
Rerunning the strategy on the sub portfolio of these five stocks gives a somewhat better performance.

Again, the following code is use to load DJI data from Yahoo and run the strategy

portfolio.DJI <- getIndexComposition('^DJI')
strategy(portfolio.DJI,.9,TRUE,"2011-01-01::2012-11-01",title="Portfolio DJI, Jan-2011 - Nov 2012")
strategy(c("TRV",XOM","CVX",PG","JPM"),.9,TRUE,"2011-01-01::2012-11-01",title="Portfolio 5 out of DJI, Jan-2011 - Nov 2012")

Next a screening a universe of about 5000 NYSE and NASDAQ stocks for those with the best individual auto-correlation within the 2011 time frame. Then running the strategy for 2011 and out of sample in Jan-Nov 2012, with either the 5 or the 100 best names. The constituent portfolio consists of “UBFO”,”TVE”,”GLDC”,”ARI”,”CTBI”.


We test drove the passive aggressive trading strategy on recent daily price data. While not doing very well on a mainstream ticker like DJI, pre-screening the stock universe according to a auto-correlation did improve results, even when running the strategy out of sample for the purpose of cross validation. Other extensions, like including a cash asset or allowing for short positions require further investigation.

The strategy has interesting elements, such as minimizing the chance in the portfolio weights and the use of a loss function. It does only directly take the immediate one-period lag into account. For use in other time domains, like high frequency, one could consider having a collection of competing trading agents with a range of time lags that filter out the dominate mean reverting mode.

There are a verity of approaches in the literature that I like to demonstrate as well in the framework of this blog, stay posted.

Appendix: R functions


# global market data environment
global.env <- new.env()

# search anti correlated stocks
# also return daily variance and maximum one day variance
portfolio.screen <- function(portfolio, daterange="")
   ac <- matrix()
   for(stk in portfolio)
        p.xts <- Ad(get(stk,global.env)[daterange])
        dr.xts <- dailyReturn(p.xts)
        dr.maxvar <- max( (dr.xts-mean(dr.xts) )^2)
        dr.var <- coredata(var(dr.xts))
        dr.var.ratio <- dr.maxvar/(dr.var)
        dr.ac <-  coredata( acf(dr.xts)$acf[1:5] )
        dr.ac.res <- c(stk, dr.var.ratio, dr.maxvar, dr.var, dr.ac)

          ac <- dr.ac.res
          ac <- rbind(ac, dr.ac.res)
# sort by decreasing auto-correlations
ac_sort <- (ac[order(as.numeric(ac[,6]),decreasing='FALSE'),])
# return sorted matrix of stocks and correlations

# portfolio weight vector b
# price relative return vector x
# threshold parameter e, usually e <= 1 for

# returns intrinsic value of call on portfolio return with strike e
strategy.loss <- function(b, x, e)

strategy.gearing <- function(b, x, e)
    loss <- strategy.loss(b,x,e)
    varx <- var(x)
    if(varx>0) { -loss/varx }
    else { 0}

# return a vector w so that w is close to v
# by minimizing the distance (w-v)^2
# with the condition that w is on the simplex
# sum(w)=1
strategy.simplex <- function(v,z)
    # initialize projection to zero
    w <- rep(0, length(v) )
    # Sort v into u in descending order

    idx <- order(v,decreasing=TRUE)
    u <- v[idx]

    #index of number of non zero elements in w
    rho <-max(index(u)[index(u)*u  >= cumsum(u) -z])
       theta <-(sum(u[1:rho])-z)/rho
       w[idx[1:rho]] <- v[idx[1:rho]] - theta;

strategy.rebalance <- function(b,x,e)
    dbdx <- strategy.gearing <- strategy.gearing(b, x, e)
    b_new <- b + dbdx*(x-mean(x))
    b_new <- strategy.simplex(b_new,1)

# Simulate the portfolio trading strategy
# Arguments:
# portfolio : portfolio of stocks f
# threshold: mean reversion loss parameter ( should be in the order but smaller than 1)
# doplot: create an optional plot
# daterange: range of dates to test
# returns the accumulated gain of the strategy
strategy <- function(portfolio, threshold=1, doplot=FALSE,daterange="",title="" )
   portfolio.prices.xts <- xts()
   # raw portfolio components price time series
   for( stk in portfolio)
       portfolio.prices.xts <- cbind(portfolio.prices.xts, Ad(get(stk,envir = global.env)[daterange]))
   # downfill missing closing prices (last observation carried forward)

   portfolio.prices.xts <- na.locf(portfolio.prices.xts)
    # not backfill missing history backwards, for example for IPO assets that did
    # not exist, we set these to the constant initial 'IPO' price , resembling a cash asset
    portfolio.prices.xts <- na.locf(portfolio.prices.xts,fromLast=TRUE)
    times <- index(portfolio.prices.xts) 
    nprices <- NROW(portfolio.prices.xts)

     # relative prices S(t)/S(t-1)
     portfolio.price.relatives.xts <- (portfolio.prices.xts/lag(portfolio.prices.xts,1))[2:nprices,]

     # initialize portfolio weights time series
     portfolio.weights.xts <- xts(matrix(0,ncol=length(portfolio),nrow=nprices-1),

     colnames(portfolio.weights.xts) = portfolio

     # initialize strategy with equal weights
     portfolio.weights.xts[1,] = rep(1/length(portfolio),length(portfolio))

     # run strategy:
     for( i in 1:(nprices-2))
        #portfolio weights at opening of the day
        b_open <- portfolio.weights.xts[i,]

        #price relative return at close x = S(t)/S(t-1)
        x_close <- portfolio.price.relatives.xts[i,]

       # compute end of day rebalancing portfolio with strategy
       b_new <-  strategy.rebalance(as.vector(b_open),as.vector(x_close),threshold)

       # assign portfolio composition for next day opening
       portfolio.weights.xts[i+1,] <- b_new

   #aggregate portfolio returns

    portfolio.total.relatives.xts <- xts( rowSums(portfolio.weights.xts * portfolio.price.relatives.xts), order.by=as.POSIXct(times[2:nprices]))
     portfolio.cum.return.xts <- cumprod( portfolio.total.relatives.xts)

    # cummulative wealth gained by portfolio

      Times <- times

      Performance=c(1,coredata(portfolio.cum.return.xts)), pane='Portfolio')
      myplot <- ggplot()+theme_bw()+
      facet_grid (pane ~ ., scale='free_y')+
      labs(title=title) +
      ylab("Cumulative Performance ") +
      geom_line(data = df,  aes(x = Date, y = Performance), color="blue")

      for(i in 1:length(portfolio))
        df1<-data.frame(Date=Times, Performance=c(1,coredata(cumprod(portfolio.price.relatives.xts[,i] ) ) ),pane='Components')
        names(df1)[2] <- "Performance"
        myplot <- myplot +
        geom_line(data = df1, aes_string(x = "Date", y = "Performance"),color=i)

        df3<-data.frame(Date=Times, Performance=c(1,coredata(cumprod(portfolio.price.relatives.xts[,i] ) ) ), pane='Components')
         names(df1)[2] <- "Performance"
         myplot <- myplot +
         geom_line(data = df1, aes_string(x = "Date", y = "Performance"),color=i)

          Weights=coredata(portfolio.weights.xts[,i]) , pane='Weights')

          names(df2)[2] <- "Weights"

           myplot <- myplot + geom_line(data = df2, aes_string(x = "Date", y = "Weights"),color=i)
} # end of plotting

# return performance
Posted in Uncategorized | Tagged , , , , , , , | 6 Comments

Portfolio Selection and Optimization

Let’s look at the task of selecting a portfolio of stocks that optimize a particular measure of performance. In the structured product setting one might want to compose a portfolio to be used as a reference index for a derivative, with the objective that the index needs a specifically high or low correlation structure, since the overall correlation affects the volatility of the final product. Similarly an individual investor might like to select a number of stocks in order to compose a portfolio that historically had a low volatility and a high growth rate. Yet another objective could be to find portfolio with a specific correlation behavior with respect to a particular target.

To show a particular example, let’s try to select a portfolio composed out of four assets out of a larger universe of around 5,000 NYSE and NASDAQ names that would have the highest possible sharp ratio in the year of 2011. The sharp ratio measures the ratio of average daily returns over its standard derivations and is a widely used performance measure.

The attached R code runs the problem against a small universe of DJI stocks, in order not to require uploading of a large number of market data. For this exercise I don’t correct the stock growth for the risk-free rate and consider long only stock combinations.

For this task one must first be able to calculate the relative weights for a given set of four stocks that optimize the sharp ratio. In the example code the optimizer solnp from the R package Rsolnp is used, instead of a specialized function from a financial package, in order to retain the flexibility of optimizing other non standard measures.

There are a number of approaches that could be used for stock selection. The brute force method would be to iterate over all possible combinations, which quickly becomes a very large problem to solve with limited computational resources. For example to check all four name portfolios out of a universe of N stocks would require N*(N-1)*(N-2)*(N-3)/24 combinations.

On the other end on can use a bootstrapping method, by first selecting the stock with the single best sharp ratio. Finding this stock requires N computations. Next checking all two stock portfolios that are composed out of the best stock and all possible selections from the universe adds N-1 calculations. Repeating this sequence of adding just single names at a time leads first to a three name and finally to a four name portfolio after a total of 4*N-6 steps. Having arrived at an already half-way optimized four name portfolio one can then try to improve it further by rotating out individual names against the universe of stocks.

A systematic variation to the try and error bootstrapping leads to the tree based search method illustrated in the following.

The picture below for the universe of 30 DJI stocks starts off at the bottom node with the portfolio the stocks “JPM”, “HPQ”, “AA”, BAC” with a low performance sharp ratio of -0.42.  From this the tree search generates four sub portfolio combinations, (“HPQ”, “AA”, BAC”), (“JPM”,”AA”, BAC”),(“JPM”, “HPQ”, BAC”) and (“JPM”, “HPQ”, “AA”) by removing one member each. Then it linear-searches among the universe of stocks for the best name to add to each sub portfolio in order to arrive at a new set of four-name portfolios.  The one with the best sharp ratio is then selected. In the graph below this was the combination (“AA”, “BAC”,”HPQ”,”MCD”). It becomes the root node of the new search tree branch. Eventually the tree search ends when no new combinations with higher sharp ratio can be found along the deepest search path.


The following shows a similar search tree, now using the full 5000 NYSE and NASDAQ stocks for that we got sufficient historical data without missing dates.


The graphs below depict the historical cumulative performance of the optimized portfolio against the performance of the initial portfolio. Note that the search among the 5000 stock universe yielded a growth portfolio with much smaller variance than just selecting among the 30 DJI stocks.

Further optimization approaches are Monte Carlo like tactics of selecting random starting points for multiple searches. Other extensions are genetic algorithms that create new trial portfolios out the genetic crossings of two parent portfolios. Overall the tree approach here seemed to be robust against varying the selection of the starting root portfolio and converges after few search nodes.

This post is mainly to demonstrate the optimization technique rather than to advertise a particular portfolio selection. In reality a portfolio that performed particularly well in one year is not guaranteed to have similar characteristics in the following year. A more complete study would include back-testing and cross-validation of the selected portfolios against data that was not used during optimization.

To run the tree search first create the following functions in R by copying and pasting

# load list of symbols via getSymbols()
# enclose fetching symbols within try clause in case of faulty/non-existing tickers
# disregard symbols without a complete history
# return list of successful loaded names
symbol.exists <- function(stk)
     tryCatch( typeof(get(stk,env=global.env))=="double", error = function(e) FALSE )
symbols.load <- function(indx.master)
  indx <- c() # list of loaded tickers
  for(stk in indx.master)
     cc <- 0
     while( symbol.exists(stk) == FALSE & cc < 2 )
       cc <- cc +1   
       tryCatch( getSymbols(stk, from="2011-01-03", to="2011-12-30",env=global.env), error = function(e) Sys.sleep(1) )
         indx<-c(indx,stk)  }}

# check master symbol list
# return clean list of symbols that are successfully loaded
symbols.clean <- function(indx.master)
  indx <- c()
  for(stk in indx.master){ if(symbol.exists(stk)==TRUE) {
            indx<-c(indx,stk) }}}

# compute list of sharp ratio for each name in the indx list
sharp.ratios <-function(indx)
 stks.dailyret.ts <- xts()
 cc <-1
 for(stk in indx){
   stk.ts <- Ad(get(stk,env=global.env)['2011-01-03::2011-12-30'])
   stk.cumret.ts <- stk.ts/coredata(stk.ts[1])[1]
   stk.dailyret.ts <- tail(stk.cumret.ts/lag(stk.cumret.ts,k=1)-1,-1)
   stks.dailyret.ts <- merge(stks.dailyret.ts, stk.dailyret.ts)
   stk.mean <- mean(stk.dailyret.ts)
   stk.var <- var(stk.dailyret.ts)
   stk.sharp <- sqrt(252)*stk.mean/sqrt(stk.var)
   results[cc,] <-  c(stk, stk.mean, sqrt(stk.var), stk.sharp)
  cc <- cc+1

# compute sharp ratio and optimum combination for names in stks
sharp.ratio <- function(stks)
  stks.dailyret.ts <- xts()
  for(stk in stks ){
   stk.ts <- Ad(get(stk,global.env)['2011-01-03::2011-12-30'])
   stk.cumret.ts <- stk.ts/coredata(stk.ts[1])[1]
   stk.dailyret.ts <- tail(stk.cumret.ts/lag(stk.cumret.ts,k=1)-1,-1)
   stks.dailyret.ts <- merge(stks.dailyret.ts, stk.dailyret.ts)
 dcount <- sqrt(252)
 V <- cov(stks.dailyret.ts)
  res <- solnp(
  rep(1/length(mu), length(mu)),
  function(w) - dcount*t(w) %*% mu / sqrt( t(w) %*% V %*% w ),
  eqfun = function(w) sum(w),
  eqB   = 1,
  LB = rep(0, length(mu))
  res$mu <-  t(res$pars) %*% mu
  res$std <-  sqrt( t(res$pars) %*% V %*% res$pars )
  res$sharp <- -tail( res$values,1)

sharp.simple.ratio <- function(stks)
 stks.cumret.ts <- xts()
 for(stk in stks ){
  stk.ts <- Ad(get(stk,global.env)['2011-01-03::2011-12-30'])
  stk.cumret.ts <- stk.ts/coredata(stk.ts[1])[1]
  stks.cumret.ts <- merge(stks.cumret.ts, stk.cumret.ts)
dcount <- sqrt(252)
res <- solnp(
  rep(1/length(stks), length(stks)),
  function(w) {
   stks.dailyret.ts <-  stks.cumret.ts %*% w
   n <- length(stks.dailyret.ts)
   stks.dailyret.ts <- stks.dailyret.ts[2:n]/stks.dailyret.ts[1:n-1]-1
   - dcount*  mean(stks.dailyret.ts)/  sqrt(var(stks.dailyret.ts))
  eqfun = function(w) sum(w),
  eqB   = 1,
   LB = rep(0, length(stks))
stks.dailyret.ts <-  stks.cumret.ts %*% res$par
n <- length(stks.dailyret.ts)
stks.dailyret.ts <- stks.dailyret.ts[2:n]/stks.dailyret.ts[1:n-1]-1
res$mu <- mean(stks.dailyret.ts)
res$std <-  sqrt(var(stks.dailyret.ts))
res$sharp <- -tail( res$values,1)

# scan universe for best stk to add to portfolio
portfolio.add <- function(portfolio, universe)
  # check home many of the portfolio stocks are also part of the universe
  noverlap <- sum(sapply(1:length(portfolio), function(i) any(portfolio[i]==universe)))
  # allocate result matrix
  cc <- 1
  for (i in 1:n)   if(any(portfolio==universe[i])==FALSE)
   res <- c(sprintf("%s",stks), sr$mu, sr$std, sr$sharp )
   results[cc,] = res
   cc <- cc+1
results.ordered <- results[order(as.numeric(results[,ncol(results)]),decreasing=TRUE),]
node.id <- function(portfolio)
  ratio <- sprintf("%2.2f",as.numeric(sharp.ratio(portfolio)$sharp))

portfolio.visited <- function(portfolio, visited)
  any(portfolio[1]==visited[,1] & portfolio[2]==visited[,2] & portfolio[3]==visited[,3] )
portfolio.visited4 <- function(portfolio, visited)
 any(portfolio[1]==visited[,1] & portfolio[2]==visited[,2] & portfolio[3]==visited[,3], portfolio[4]==visited[,4] )

and then execute the following

global.env <- new.env()
indx.master <- getIndexComposition('^DJI')
universe <- symbols.clean(indx.master)
portfolio <- c("JPM", "HPQ", "AA", "BAC")

mytree <- (read.tree(text = paste("(:1,:1,:1,",node.id(portfolio),":1):1;")))
mytree$node.label <- "ROOT"

visited <- c()
sharp.max <- -999999
search.results <- c()
cont <- TRUE
while(cont == TRUE)
    # create combinations with of subportfolios with 3 names
    portfolios <- list( c(portfolio[2],portfolio[3],portfolio[4]),
                         c(portfolio[1],portfolio[2],portfolio[3]) )
    # order each subportfolio alphabetical
    portfolios <- lapply(1:4, function(i) {
           portfolios[[i]] [ order( portfolios[[i]]     ) ]  }  )

i.portfolio.best <- 0
my.portfolio.best <- c()
mytree.txt <- "("
visited4 <- c()
for(i in 1:4)
         visited <-  rbind(visited,portfolios[[i]])
         my.results  <- portfolio.add(portfolios[[i]], universe)    
         my.results.best <- my.results[1,1:7]
         my.sharp <- my.results[1,7]
         search.results <- rbind( search.results , my.results.best)
         # check if we got this node already
         my.results.best.ordered <-  my.results.best[1:4] [ order( my.results.best[1:4]  ) ]
           visited4 <-  rbind(visited4, my.results.best.ordered)
           if(i>1) { mytree.txt <- paste(mytree.txt,",") }
           mytree.txt <- paste(mytree.txt, node.id(my.results.best[1:4]),":1")

           if(my.sharp > sharp.max){
              sharp.max <- my.sharp
              i.portfolio.best <- i
              my.portfolio.best <- my.results[1,1:4]}

mytree.txt <- paste(mytree.txt,"):1;")
if( i.portfolio.best > 0 )
    mytree.sub <-  read.tree(text = mytree.txt)
    mytree.sub$node.label <- node.id( portfolio)
    mytree <- bind.tree(mytree, mytree.sub,
    where=which(mytree$tip.label == node.id(portfolio) ), position=0)


    portfolio <- my.portfolio.best

if(i.portfolio.best == 0 )
    cont <- FALSE

mytree.drop <- drop.tip(mytree,1:2)
Posted in Uncategorized | Tagged , , , | Leave a comment

Picturing Trees



In this post I like to illustrate the R package “ape” for phylogenetic trees for the purpose of assembling trees. The function read.tree creates a tree from a text description. For example the following code creates and displays two elementary trees:

tree.top <- (read.tree(text = "(TIP1:1,TIP2:2,TIP3:3):1;"))
tree.top$node.label <- "ROOT"

tree.child <- (read.tree(text = "(TIP4:1,TIP5:2):1;"))
tree.child$node.label <- "CHILD"



The required syntax for the read.tree text input is ( TIP_NAME : TIP_BRANCH_LENGTH , … ) : ROOT_BRANCH_LENGTH;

Note that for better illustration in the example above the length are chosen differently for the branches that lead to each leaf or tip.

Next the function bind.tree is used to attach the child tree to the root tree. The argument where  determines the position within the parent root tree where the child branch should be attached. In the first example below the child branch is attached to the branch of the ‘TIP1″ leaf. In this process the receiving leaf label is replaced by the new branch. In order to retain the receiving leaf label, I set the node.label attribute of the incoming child tree to its name.


tree.child$node.label <- "TIP1"
tree.combined <- bind.tree(tree.top, tree.child, where=which(tree.top$tip.label == tree.child$node.label ), position=0)

tree.child$node.label <- "TIP2"
tree.combined <- bind.tree(tree.top, tree.child, where=which(tree.top$tip.label == tree.child$node.label ), position=0)

tree.child$node.label <- "TIP3"
tree.combined <- bind.tree(tree.top, tree.child, where=which(tree.top$tip.label == tree.child$node.label ),position=0)


This procedure of adding branches to leafs can be repeated to build more and more complex trees. Happy gardening !


Posted in Uncategorized | Tagged , , | Leave a comment

E-Learning to Machine Learn

A short post about online educational resources on machine learning. Perhaps it is a sign of increasing popularity of the field that there are now several courses on machine learning accessible online and for free. 

Out of Standford University comes Andrew Ng’s almost ‘legendary’ machine learning lecture presented on the Coursera platform. The class consists of video lectures, multiple choice style review quizzes and programming projects in Matlab. A discussion forum supplies student interaction that otherwise only a traditional university program offers. The class covers a good verity of supervised and unsupervised learning concepts, like neural nets, support vector machines, k-means and recommender systems.

Another online course is Berkeley’s CS188.1x Artificial Intelligence out of the EdX platform by Harvard and MIT. This course offers video lectures on topic like decision theory and reenforcement learning with programming assignments in Python.

A third offering in this direction is Sebasitan Thrun’s course on Artificial Intelligence presented at Udacity. Here the lectures are themed around applications to robotics on the example of Google’s driverless car . Udacity’s takes an interesting educational approach. It presents lectures in short sequences that are frequently interrupted by quick multiple choice style review questions. The purpose is to keep the students attention focused on the lecture and provide the feel of real teacher interaction. Similar to the EdX course programming projects are done in Python. Here the student can paste his code directly into the Udacity browser window for instant feedback, without actually having to run Python on the local machine.

Another platform to watch is Kahn Academy that originally started at the elementary to high school education but appears to be destined to expand into the university and professional level.

With the cost of learning at traditional universities exploding, at least in some parts of the world, it is fascinating to observe the formation of free or low cost online educational platforms. Possible revenue models for these platforms could be that lectures are going to be offered for free but a payment is requested for tests and certifications. There are already professional designations like the chartered financial analyst (CFA) or the FRM for risk management that are based on certifications outside the traditional university education setting. Interestingly a lot of development in the e-learning area comes out of the traditional schools, as a successful online program is certainly a great advertisement and recruitment tool by itself.

Posted in Uncategorized | Tagged , , , , , , | 1 Comment

Learning Kernels SVM

Machine Learning and Kernels

A common application of machine learning (ML) is the learning and classification of a set of raw data features by a ML algorithm or technique. In this context a ML kernel acts to the ML algorithm like sunshades, a telescope or a magnifying glass to the observing eye of a student learner. A good kernel filters the raw data and presents its features to the machine in a way that makes the learning task as simple as possible.

Historically a lot of progress in machine learning has been made in the development sophisticated learning algorithms, however selecting appropriate kernels remains a largely manual and time consuming task.

This post is inspired by a presentation by Prof. Mehryar Mohri about learning kernels. It reflects on the importance of kernels in support vector machines (SVM).

A total of three examples are presented. A linear kernel is shown to solve the first example but fails for the second task. There a square kernel is successful. Then a third example is presented for that both linear and square kernels are not sufficient. There a successful kernel can be generated out of a mixture of both base kernels. This illustrates that kernels can be generated out of bases, resulting in products that are more powerful in solving the task at hand than each individual components.

Support Vector Machines

Consider a support vector machine (SVM) for a classification task. Given a set of pairs of feature data-point vectors x and classifier labels y={-1,1}, the task of the SVM algorithm is to learn to group features x by classifiers. After training on a known data set the SVM machine is intended to correctly predict the class y of an previously unseen feature vector x.

Applications in quantitative finance of support vector machines include for example predictive tasks, where x consists of features derived from a historical stock indicator time series and y is a sell or buy signal. Another example could be that x consist of counts of key-words within a text such as an news announcements and y categorizes it again according on its impact to market movements. Outside of finance a text based SVM could be used to filter e-mail to be forwarded to either the inbox or the spam folder.

Linear Kernel

As indicated above the SVM works by grouping feature points according to its classifiers.
For illustration in the toy example below two dimensional feature vectors x={x1,x2} are generated in such a way that the class y=-1 points (triangles) are nicely separated from the class y=1 (circles).

The SVM algorithm finds the largest possible linear margin that separates these two regions. The marginal  separators rest on the outpost points that are right on the front line of their respective regions. These points, marked as two bold triangles and one bold circle in the picture below, are named the ‘support vectors’ as they are supporting the separation boundary lines. In fact the SVM learning task fully consists of determining these support vector points and the margin distance that separates the regions. After training all other non-support points are not used for prediction.


In linear feature space the support vectors add to an overall hypothesis vector h, h = \sum c_i x_i, such that the classification frontiers are given by the lines h x + b = 1 and h x + b = -1 centered around h x + b = 0.

The code below utilizes the ksvm implementation in the R package ‘kernlab’, making use of “Jean-Philippe Vert’s” tutorials for graphing the classification separation lines.


kfunction <- function(linear =0, quadratic=0)
  k <- function (x,y)
     linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
  class(k) <- "kernel"

n = 25
a1 = rnorm(n)
a2 = 1 - a1 + 2* runif(n)
b1 = rnorm(n)
b2 = -1 - b1 - 2*runif(n)
x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))
svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(1,0),scaled=c())
plot(c(min(x[,1]), max(x[,1])),c(min(x[,2]), max(x[,2])),type='n',xlab='x1',ylab='x2')
title(main='Linear Separable Features')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w <- colSums(coef(svp)[[1]] * x[SVindex(svp),])
b <- b(svp)
# Draw the lines

Quadratic Kernel

The following example illustrates a case where the feature points are non-linear separated. Points of the class y=1 (circles below) are placed in a inner region surrounded from all sides by points of class y=-1, again depicted as triangles. In this example there is no single straight (linear) line that can separate both regions. However here it is still possible to find such a separator by transforming the points x=\{x_1, x_2\}  from feature space to a quadratic kernel space with points given by the corresponding square coordinates \{x_1^2, x_2^2\}.

The technique of transforming from feature space into a measure that allows for a linear separation can be formalized in terms of kernels. Assuming  \Phi() is a vector coordinate transformation function. For example a squared coordinate space would be \{ \Phi(x)_1 , \Phi(x)_2 \} = \{ x_1^2 , x_2^2 \}. The SVM separation task is now acting on in the transformed space to find the support vectors that generate

h \Phi(x) + b = \pm 1

for the hypothesis vector h = \sum c_i \Phi(x_i) given by the sum over support vector points x_i. Putting both expressions together we get

\sum c_i K(x_i,x) + b = \pm 1

with the scalar kernel function K(x_i,x) = \Phi(x_i) \Phi(x). The kernel is composed out of the scalar product between a support vector x_i and another feature vector point x in the transformed space.

In practice the SVM algorithm can be fully expressed in terms of kernels without having to actually specify the feature space transformation. Popular kernels are for example higher powers of the linear scalar product (polynomial kernel). Another example is a probability weighed distance between two points (Gaussian kernel).

Implementing a two dimensional quadratic kernel function allows the SVM algorithm to find support vectors and correctly separate the regions. Below the graph below illustrates that the non-linear regions are linearly separated after transforming to the squared kernel space.


The “R” implementation makes use of ksvm’s flexibility to allow for custom kernel functions. The function ‘kfunction’ returns a linear scalar product kernel for parameters (1,0) and a quadratic kernel function for parameters (0,1).

kfunction <- function(linear =0, quadratic=0)
  k <- function (x,y)
     linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
  class(k) <- "kernel"

n = 20
r = runif(n)
a = 2*pi*runif(n)
a1 = r*sin(a)
a2 = r*cos(a)
r = 2+runif(n)
a = 2*pi*runif(n)
b1 = r*sin(a)
b2 = r*cos(a)
x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))
svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(0,1),scaled=c())
plot(c(min(x[,1]), max(x[,1])),c(min(x[,2]), max(x[,2])),type='n',xlab='x1',ylab='x2')
title(main='Feature Space')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w2 <- colSums(coef(svp)[[1]] * x[SVindex(svp),]^2)
b <- b(svp)

x1 = seq(min(x[,1]),max(x[,1]),0.01)
x2 = seq(min(x[,2]),max(x[,2]),0.01)

points(-sqrt((b-w2[1]*x2^2)/w2[2]), x2, pch = 16 , cex = .1 )
points(sqrt((b-w2[1]*x2^2)/w2[2]), x2, pch = 16 , cex = .1 )
points(x1, sqrt((b-w2[2]*x1^2)/w2[1]), pch = 16 , cex = .1 )
points(x1,  -sqrt((b-w2[2]*x1^2)/w2[1]), pch = 16, cex = .1 )

points(-sqrt((1+ b-w2[1]*x2^2)/w2[2]) , x2, pch = 16 , cex = .1 )
points( sqrt((1 + b-w2[1]*x2^2)/w2[2]) , x2,  pch = 16 , cex = .1 )
points( x1 , sqrt(( 1 + b -w2[2]*x1^2)/w2[1]), pch = 16 , cex = .1 )
points( x1 , -sqrt(( 1 + b -w2[2]*x1^2)/w2[1]), pch = 16, cex = .1 )

points(-sqrt((-1+ b-w2[1]*x2^2)/w2[2]) , x2, pch = 16 , cex = .1 )
points( sqrt((-1 + b-w2[1]*x2^2)/w2[2]) , x2,  pch = 16 , cex = .1 )
points( x1 , sqrt(( -1 + b -w2[2]*x1^2)/w2[1]), pch = 16 , cex = .1 )
points( x1 , -sqrt(( -1 + b -w2[2]*x1^2)/w2[1]), pch = 16, cex = .1 )

xsq <- x^2
svp <- ksvm(xsq,y,type="C-svc",C = 100, kernel=kfunction(1,0),scaled=c())

plot(c(min(xsq[,1]), max(xsq[,1])),c(min(xsq[,2]), max(xsq[,2])),type='n',xlab='x1^2',ylab='x2^2')
title(main='Quadratic Kernel Space')
ymat <- ymatrix(svp)
points(xsq[-SVindex(svp),1], xsq[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(xsq[SVindex(svp),1], xsq[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w <- colSums(coef(svp)[[1]] * xsq[SVindex(svp),])
b <- b(svp)
# Draw the lines

Alignment and Kernel Mixture

The final exploratory  feature data set consists again of two classes of points within two dimensional space. This time two distinct regions of points are separated by a parabolic boundary, where vector points of class y=1 (circles) are below and y=-1 (triangles) are above the separating curve. The example is selected for its property that neither the linear nor the quadratic kernels alone are able to resolve the SVM classification problem.

The second graph below illustrates that feature points of both classes are scattered onto overlapping regions in the quadratic kernel space. It indicates that for this case the sole utilization of the quadratic kernel is not enough to resolve the classification problem.

In “Algorithms for Learning Kernels Based on Centered Alignment” Corinna Cortes, Mehryar Mohri and Afshin Rostamizadeh compose mixed kernels out of base kernel functions. This is perhaps similar to how a vector can be composed out of its coordinate base vectors or a function can be assembled in functional Hilbert space.

In our example we form a new mixed kernel out of the linear and the quadratic kernels K_1 and K_2

K_M(x,z) = \rho_1 K_1(x,z) + \rho_2 K_2(x,z)

The graph below demonstrate that the mixed kernel successfully solves the classification problem even thought each individual base kernels are insufficient on its own. In experimentation the actual values of the mixture weights \rho_1, \rho_2 are not critical. Following the study above we choose the weights according to how much each individual base kernel on its individual aligned with the raw classification matrix Y = y^T y composed out of the classifier vector y.

Alignment is based on the observation that a perfectly selected kernel matrix K_{i,j} = K(x_i,x_j) would trivially solve the classification problem if its elements are equal to the Y matrix. In that case choosing an arbitrary vector x_1 as support the the kernel K(x_1, x_i) would give 1 if x_1 and x_i are in the same category and -1 otherwise.

Similar to the concept of correlation, kernel alignment between two kernels is defined by the expectation over the feature space, resulting in matrix products

\rho ( K^c_1 , K^c_2 ) =  \frac{ < K^c_1 , K^c_2  > }{ \sqrt{ < K^c_1 , K^c_1  > < K^c_2 , K^c_2  > } }

with the product $latex = Tr[A^T B]$. The expectation is take assuming centered n \times n kernel matrices according to K^c = (I-(11^T)/n)K(I-(11^T)/n). (For details see reference above.)


Kernel matrix centering and alignment functions are implemented in the ‘R’ as following


n = 100
a1 = 4*runif(n)-2
a2 = 1 - a1^2 - 0.5 - 2*runif(n)
b1 = 4*runif(n)-2
b2 = 1 - b1^2 + 0.5 + 2*runif(n)

x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))

kfunction <- function(linear, quadratic)
  k <- function (x,y)
     linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
  class(k) <- "kernel"

center_kernel <- function(kernel)
m <- length(kernel[,1])
ones <- matrix(rep(1,m),m)
(diag(m) - ones %*% t(ones) / m)*kernel*(diag(m) - ones %*% t(ones) / m)
f_product<- function(x,y)
f_norm<- function(x)
kernel_alignment <- function(x,y)
x_kernel1 <- kernelMatrix(kfunction(1,0),x)
y_kernel   <- y %*% t(y)
x_kernel2 <- kernelMatrix(kfunction(0,1),x)

x_kernel1_c <- center_kernel(x_kernel1)
x_kernel2_c <- center_kernel(x_kernel2)

y_kernel_c <- center_kernel(y_kernel)

alignment1 <- kernel_alignment(x_kernel1_c,y_kernel_c)
alignment2 <- kernel_alignment(x_kernel2_c,y_kernel_c)

x_kernel3 <- kernelMatrix(kfunction(alignment1, alignment2),x)
x_kernel3_c <- center_kernel(x_kernel3)
alignment3 <- kernel_alignment(x_kernel3_c,y_kernel_c)

svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(alignment1,alignment2), scaled=c())
plot(c(min(x[]), max(x[])), c(min(x[]), max(x[])),type='n',xlab='x1',ylab='x2')
title(main='Parabolic Featureset (Mixed Kernel SVM)')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w <- colSums(coef(svp)[[1]] * (alignment1*x[SVindex(svp),]))
v <- colSums(coef(svp)[[1]] * (alignment2*x[SVindex(svp),]^2))
b <- b(svp)

x1 = seq(min(x[]),max(x[]),0.01)
x2 = seq(min(x[]),max(x[]),0.01)

points( sqrt( (b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( -sqrt( (b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( x1, sqrt( (b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )
points( x1, -sqrt( (b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )

points( sqrt( (1+ b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( -sqrt( (1 + b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( x1, sqrt( (1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )
points( x1, -sqrt( (1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )

points( sqrt( (-1+ b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( -sqrt( (-1 + b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( x1, sqrt( (-1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )
points( x1, -sqrt( (-1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )

title(main='Not Linear Separable in Quadratic Kernel Coordinates')

In conclusion this post attempts to illuminate the importance role of kernel selection in machine learning and demonstrates the use of kernel mixture techniques. Kernel weights were chosen somewhat ad-hoc by alignments to the target classifiers, thus preceding the actual SVM learning phase. More sophisticated algorithms and techniques that automate and combine these steps are a topic of current ongoing research.

Posted in Uncategorized | Tagged , , , , , , | 3 Comments

Bid-Ask Spreads, Conic Finance and the MINVAR Function

Conic Finance

This post recaps the interesting concept of conic finance in the context of incorporating bid-ask spreads onto the idealized models of markets that are originally neutral to risks. It draws from original papers on this topic by Dilip Madan, Alexander Cherny, Ernst Eberlein, Juan Jose Vicente Alvarez, Peter Carr and others found at An Introduction to Conic Finance and its Applications.

Ideas from this area are illustrated with an elementary example of an asset that follows a binomial process. At first the bid-ask spread is evaluated to be zero under a risk-neutral market assumption. Then the example illustrates how a conic market distortion can create a non-zero spread.

Complete Risk Neutral Market

A fundamental concept in financial theory is the assumption of a complete market and the existence of a risk neutral measure. A risk neutral market is indifferent to either buying or selling at any size since all cashflows can be hedged perfectly and no residual risk is left.

Let’s for example consider being long an asset X in a market with a risk neutral probability for a one period move of (1/3)  to $120 and (2/3) to $90.

The risk neutral expectation holding this position is

E[X_{long}] = 1/3*\$ 120 +2/3*\$90 = \$100

The corresponding short position in X represents itself as

with expected outcome of

E[X_{short}] = 2/3*(-\$ 90) +1/3*(-\$ 120) = -\$ 100

A risk neutral market does not have any preference between being long or short. It will pay (bid) $100 for owning the asset and charge (ask) $100 for assuming the liability.

The transition probabilities imply cumulative probabilities given by the following step functions

P_{long}(X \le 120) = 1
P_{long}(X < 120) = 2/3
P_{long}(X \le 90) = 2/3
P_{long}(X < 90 ) = 0
P_{short}(X \le -90) = 1
P_{short}(X < -90) = 1/3
P_{short}(X \le -120) = 1/3
P_{short}(X < -120) = 0

as depict in this graph

We also have
P_{long}(X \le x) = 1 - P_{short}(X \le -x)
P_{short}(X \le x) = 1 - P_{long}(X \le -x)

Bidding and Asking

Generally real markets have a bid-ask spread. The ask price a is the price one must pay to buy an asset or a product from the market. The bid price b is the price one receives selling an asset to the market. Generally we observe a positive bid-ask spread

s = a-b \ge 0

In a market that depicts a non zero bid-ask spread long assets and liabilities (short assets) are valued differently. Unwinding a position of a long asset entails selling the asset to the market for the bid price, thus the bid price reflects the value of a long position. Unwinding liabilities, or a position in a short asset, requires buying the asset from the market for paying the asking price, which reflects the liability of a short position.

The MINVAR Distortion Function

As seen above the risk neutral market distribution does imply a zero bid-ask spread. In order to introduce a difference in valuation between long and short positions the distribution is distorted away from the risk neutral mean. The result of the distortion is that greater weights are given towards losses for the long position and larger liabilities for the short position. The leads to a lower expected value of the long position, thus a lower bid price, and a greater expected liability for the short position, thus a higher ask price.

At this point we regard the selection of the distortion function as a model input. In this framework the distortion function maps from risk neutral probabilities in the range from to 1 to ‘real’ market probabilities also in the range from 0 to 1, necessarily mapping the points 0 and 1 onto itself.

One interesting choice in the possible class of distortion function is the ‘minvar’ function \psi

\psi_n(y) = 1 - (1-y)^{1+n}

as introduced by A. Cherny and D. Madan in New Measures for Performance Evaluation..

For n=0 the function is the trivial identity \psi_0(y) = y , for larger value of n it is a monotonic increasing concave function, with each larger order of n function dominating the previous orders.

As Cherny and Madan pointing out the minvar function has a nice interpretation. It reflects the probability distribution of the minimum over n+1 successive drawings. Concretely for a given probability function y(x)=P(X<x) , the minvar distortion \psi_n(y(x)) reflects the probability that the minimum of n+1 independent samples of X is below x. By construction, the term (1-y)^{(1+n)} is the probability that all n+1 samples are greater than x. That means the minimum of the samples is greater than x . The complementary probability that the minimum is smaller than x is then given by the difference to 1.

As a caveat the minvar function is only one possible choice for a distortion function and the discussion above linking it to the distribution of the minimum is not needed in the present context.

Distorted Cumulative Probabilities

Taking the MINVAR function for n=1 we have

\psi_1 (y) = 1 - (1-y)^2

In the following we use \psi_1 to directly distort the risk neutral probabilities from the section above. Alternatively one could first removed the mean price of $100, and applied the distortion to the net P&L process. However for the sake of better comparisons between the risk neutral and distorted probabilities and to clearly distinguish the long and short position the mean is left in. With this we get

\psi_1 (P_{long}(X \le 120)) = 1
\psi_1 (P_{long}(X < 120)) = 8/9
\psi_1 (P_{long}(X \le 90)) = 8/9
\psi_1 (P_{long}(X < 90 )) = 0
\psi_1 (P_{short}(X \le -90)) = 1
\psi_1 (P_{short}(X < -90)) = 5/9
\psi_1 (P_{short}(X \le -120)) = 5/9
\psi_1 (P_{short}(X < -120)) = 0

Bid Price: Distorted Asset

The price process for the long asset becomes

with the expectation

E_{\psi_1}[X_{long}] = 1/9* \$ 120+8/9 * \$ 90 = \$ 93.33

Bidding prices up to the calculated amount contain all possible trades with expected positive net cash flows, or positive alpha trades. On the other hand competitive market forces maximize the bid price up to this acceptable limit.

The formal calculation of the bid price in the collection of reference papers under the aforementioned link is given as the expectation integral over the distorted asset distribution

b = \int_{-\infty}^{\infty} x d \psi ( F_{X} (x) )

Ask Price: Distorted Liability

The distorted short position involves according to

with expectation

E_{\psi_1}[X_{short}] = 4/9*(- \$ 90)+5/9*(- \$120) = - \$ 106.66

Here the market is asking to be compensated for taking on the trade. Again any in absolute terms larger bid price contains the possible positive alpha positions the market is willing to accept while competitive forces limit the asking price to the calculated amount.

Consequently the asking price is given by the (negative) expectation integral over the distorted liability distribution

a = - \int_{-\infty}^{\infty} x d \psi ( F_{-X} (x) )

or in terms of the long asset distribution F_{X}(-x) = 1 - F_{-X}(x) , after a change of signs resulting in an ask price of

a = \int_{-\infty}^{\infty} x d \psi ( 1-F_{X} (x) )

In “Markets, Profits, Capital, Leverage and Return”, Carr, Madan and Alvarez further express bid and ask expectations in terms of the inverse x = G(u) of the distribution u = F_X(x) as

b = \int_{0}^{1} G(u) d \psi ( u )
a = \int_{0}^{1} G(u) d \psi ( 1-u )

They identity the capital reserve requirement for holding a position in X  with the size of the bid-ask spread, since it is the charge to be paid for unwinding in the presence of otherwise prefect market hedges. Similarly the difference between the mid and the risk-neutral price is the average profit that is disseminated between market participants by trading. Note that the simple binomial state example in this post is not rich enough to feature non-equal mid and risk neutral expected prices.

In summary we find the distorted expectation of the long asset to be lower than the (absolute) price of the short asset. In accordance with the discussion above we identify the bid-ask prices from the asset and the liability valuated on the distorted market. In our simple example the bid and ask prices are centered around the risk-neutral mid price.

Ask: $106.66 (liability)
Bid: $93.33 (asset)
Mid: $100
Spread: $13.33

This concludes this note on the concept of conic finance illustrated on an elementary example in the context of bid-ask spreads. The methodology has a rich area of application to markets such as expected profits, capital requirements, leveraging, hedging etc, which can be found under the reference link to An Introduction to Conic Finance and its Applications.

Posted in Uncategorized | Tagged , , | Leave a comment

Coin Flipping, Stopping Time, Monte Carlo Simulation and Importance Weighting

This post discusses a classic coin flipping puzzler and explores Monte Carlo simulation techniques.

Consider a game of two players taking turns flipping a coin. The first player that flips a head wins. The question is what is the probability of winning the game for each player, and what is the expected number of turns the game goes on until someone wins.

To make the problem more general we assume the coin is not necessary fair, meaning the probability of getting a head is not equal to the probability of getting a tail.

Needless to say this type of problem has applications beyond the simple coin setup, for example we might be interested at a stopping time, how long it takes for a stock to cross a certain price level, or looking to price a knock-out option.

One solution to this problem is to add up the probabilities of each scenario to calculate the overall probabilities. Another arguably the more elegant way is to use a symmetry to get to the answer quicker. Let’s do that first.

Stopping time

Assume the probability of a head to be p, and the game has been going on for a number of steps. We are at time step i, the game has not yet ended and we are about to flip the coin again. We like to compute the expected number of time steps for the game to end.

Let’s define E(t|i) as the expected number of additional time steps for the game to take, not counting the past turns, provided we already got to the time step i.

The expected number of additional steps is composed out of two terms:

  • First is the case of throwing a win (head). The probability of this is p. The game ends at that point. In this case just one step concludes the game. The contribution of this case is the product of probability p multiplied by a single step 1.
  • Second is the case of a tail with the probability (1-p) at the next step. The game does not stop there, but is expected to continue for a yet unknown number of E(t|i+1) steps from there on. The contribution of this case is the product of the probability (1-p) and the sum of the number 1 for the single step just taken and the expected number of future steps E(t|i+1).

E(t|i) = p*1 + (1-p)*(1+E(t|i+1))

In the limit of p=100%, a certain head, the expected number of steps is just one step. In the other limit p=0%, the game just goes on, and the expectation at time i is just the expectation at the next step plus one, with the limit of a never ending game and infinite stopping time.

The trick of solving the equation above is to recognize the translational or time symmetry in the game. The expectation E(t|i) being defined as the number of expected future steps does not actually depend on the time step i. As long as the game has not ended the chances for the outcome of the game is exactly the same for each player on his turn. With this we can set

E(t) = E(t|i) = E(i|i+1)

and find the solution of E(t) = p+(1-p)*(1+E(t)) to be

E(t) = 1/p

which is the expectation of a geometric distribution.

A game with a fair coin p=50% is expected to last an average of 2 steps. This average is composed out of the chance of the first player winning  at the first flip with a 50% chance, the first player loosing and the second player winning at the second flip with a 25% chance, and so on

2 = 1*50%+2*25%+3*12.5%+4*6.25% …

Winning probability

The next question is the probability for the first player to win the game.  Let’s start by looking at the probability or expectation E[loose] of a player to loose, assuming it is his or hers turn. It is compose out of the chance to not win at the current step by throwing a tail with probability (1-p), and for the opposing player to win on his turn at he next step:

E[loose|player-1, step 1]=(1-p)*E[win|player-2, step 2]

However, the sum of probabilities of loosing and winning following your turn turn adds up to one, so that

E[win|player-1, step 1] = 1-E[loose|player-1, step 1]

Again we use translational symmetry by equating the expectation of player-1 to win on step 1 to the expectation of player-2 to win the game on his turn on step 2, with the condition that the game has progressed to his turn:

E[win|player-2, step 2] = E[win|player-1, step 1]

In other words once its your turn and you have not lost, your chances of winning are just the same as for player-1 one day one. Every day is fresh in this game and brings a equal chance of winning!

With this we get

1-E[win] = (1-p)E[win]


E[win] = 1/(2-p)

In case of a fair coin with p=50% the chances of winning are about  67%. Again this can be expressed of a series of probabilities for winning at the first step with 50% chance, winning at the third step after successive losses by both players with a 12.5% chance, and so on

2/3 = 50%+12.5%+3.125%+…

Geometric series solution

This leads to yet another way to solve this puzzles by series expansion. The total expected stopping time is a sum of time steps multiplied by products of successive tails with probability q=1-p followed by a single head p

E(t) = 1p+2qp+3q^2p+4q^3p+...

E(t) = p(1+2q+3q^2+4q^3+...)

The above series has just the structure of a derivative in respect to q, so that we can write it as

E(t) = p \frac{d}{d q}(q+q^2+q^3+q^4+...)

Identifying a geometric sum by adding in the missing 1

E(t) = p \frac{d}{d q}(-1+ \frac{d}{d q} \sum_{i=0}^{\infty} q^i )

which has a well known formula

E(t) = p \frac{d}{d q}(-1+\frac{1}{1-q})

Taking the derivative leads to the result for the expected stopping time

E(t) = p \frac{1}{(1-q)^2} = 1/p

Similarly the can also compute the winning probability by summing up the probability series. The probability to win is the sum over odd time steps composed out a single head and even number of tails

E(win) = p + q^2p+q^4p+q^6p+...

Again identifying a geometric sum in terms of the squared probability of a tail, we get the result

E(win) = p \sum_{i=0}^{\infty}q^2i = p/(1-q^2)=1/(2-p)

Monte Carlo simulation and convergence

In Monte Carlo simulation one does attempt to compute expectations such as the stopping time or winning probability by summing over scenarios. It might seem redundant to apply simulations to this simple coin game since we already computed the solution in closed form, but one could imagine variations of the game that do not lead to an as simple theoretical solution. Maybe the payout to the winner is made dependent on the number of steps taken before the game ends. A financial example could be a structured equity product that stops paying coupons upon a reference index crossing a barrier. Here, the greater the stopping time the greater the total amount of coupons collected.

Let’s compare the case of an biased and fair coin, with the biased head probability of p=11/36, and the fair head probability p_0 = 1/2. The biased coin example is taken as a proxy to an equivalent bet of winning a dice game consisting of two dices with getting at least one six being the winning play. The following Matlab code generates a graph that depicts the expected stopping time for the fair and biased coin.

title("Theoretical and approximated stopping times vs. number of steps included in estimation.")
xlabel("Number of  steps in summation")
ylabel("Expected number of flips (stopping time)")
axis([1 25 0 3.5])

It compares the exact theoretical expectation to the estimated probability of including an increasing number of steps into the summation of probabilities using the series expansion computed above.

As the graph shows we need to consider about 10 steps in order to recover the maturity of the expectation for the fair coin, while the biased coin requires about 25 steps. The biased coin in our example has a smaller probability to yield a head and stop the game, thus requiring a greater number of steps to recover the total overall expectation.

The same kind of picture unveils for the expected probability for the first player to win the game. Using the following Matlab code

2*(1:13)-1,11/36*cumsum( (25/36).^(2*(1:13)-2)),
2*(1:13)-1,0.5*cumsum( (0.5).^(2*(1:13)-2))
title("Theoretical and approximate expectation of winning vs. number of steps included in estimation.")
xlabel("Number of  steps in summation")
ylabel("Probability of winning")
axis([1 25 0.3 0.7])

plots the theoretical and estimated expectations for the fair and biased coins

Simulating coin flips

Next we intend to estimate the stopping time and winning probabilities by Monte Carlo simulation. As straight forward approach would be to simulate individual coin flips with the appropriate probabilities for the fair or biased coin.

In the following example random imaginary coin flips are simulated by sampling a uniform distribution of a total of n_bins_total. Any sample above the number n_bins_head is counted towards a tail, otherwise they are counted towards a head. A choice of  n_bins_total=36 and n_bins_head=11 simulates the biased coin in our example, while n_bins_total=36 and n_bins_head=18 corresponds to the fair coin.

However the random Monte Carlo path generated by the choice of n_bins_head and n_bins_total does not need to follow the same distribution as the actual coin we intend to  simulate, provided we adjust the likelihood ratios accordingly. The simulation below computes expectations for the fair and the biased coins simultaneously along the same path, by importance weighting each outcome by the ratio of the fair or biased respective probabilities and the weights used to generate the path. For example if the Monte Carlo generation produces a head and we are in the process of computing the expectation over the biased coin, the resulting term is multiplied by the ratio of p_head_biased/(n_bins_head/n_bins_total)).

% number of simulations m
m = 100000;
n_bins_total = 36;
n_bins_head = 11;
w_head =n_bins_head/n_bins_total;
w_tail = 1-w_head;
p_head_biased = 11/36;
p_tail_biased = 1-p_head_biased;
p_head_fair = 0.5;
p_tail_fair = 1-p_head_fair;
stopping_time = zeros(1,m);
weights = zeros(1,m);
weights_fair = zeros(1,m);

for i = 1:m
    step = 1;
    weight_biased = 1;
    weight_fair = 1;

    while (unidrnd(n_bins_total)> n_bins_head)
          # coin flip resulted in tail -> game continues
          step = step + 1;
          weight_biased= weight_biased*p_tail_biased/w_tail;
          weight_fair = weight_fair*p_tail_fair/w_tail;

     weights_biased(i) = weight_biased*p_head_biased/w_head;
     weights_fair(i) = weight_fair*p_head_fair/w_head;

t_stop_fair = mean(stopping_time.*weights_fair);
t_stop_biased = mean(stopping_time.*weights_biased);
pwin_fair = sum(weights_fair(mod(stopping_time,2)==1)) /length(stopping_time);
pwin_biased = sum(weights_biased(mod(stopping_time,2)==1)) /length(stopping_time);

fprintf( “Stopping time for fair coin:      simulated: %f, calculated %f \n” , t_stop_fair, 2);
fprintf( “Stopping time for biased coin:    simulated: %f, calculated %f \n”, t_stop_biased, 36/11);
fprintf( “First player win for fair coin:   simulated: %f, calculated %f \n”, pwin_fair, 1/(2-0.5) );
fprintf( “First player win for biased coin: simulated: %f, calculated %f \n”, pwin_biased,1/(2-11/36) );

Stopping time for fair coin: simulated: 1.999439, calculated 2.000000
Stopping time for biased coin: simulated: 3.277148, calculated 3.272727
First player win for fair coin: simulated: 0.665160, calculated 0.666667
First player win for biased coin: simulated: 0.590150, calculated 0.590164

We can visualize the progress of the Monte Carlo simulation by plotting the running winning expectation versus the number of paths generated

plot( (1:length(stopping_time) )(mod(stopping_time,2)==1)  , cumsum(weights_biased(mod(stopping_time,2)==1))./(   (1:length(stopping_time) )(mod(stopping_time,2)==1)  ) , (1:length(stopping_time) )(mod(stopping_time,2)==1) , 1/(2-11/36)+0*(1:length(stopping_time) )(mod(stopping_time,2)==1) ,
(1:length(stopping_time) )(mod(stopping_time,2)==1)  , cumsum(weights_fair(mod(stopping_time,2)==1))./(   (1:length(stopping_time) )(mod(stopping_time,2)==1)  ) , (1:length(stopping_time) )(mod(stopping_time,2)==1) , 1/(2-18/36)+0*(1:length(stopping_time) )(mod(stopping_time,2)==1) )
title(’Simulated probability of winning vs. number of simulations’)
xlabel(’Number of simulations’)
ylabel(’Probability of winning’)
axis([1 m .55 .7])

The technique of keeping the random number generator measure separate from the actual coin process allows to compute the expectation of both the fair and biased coins along the same path. It also gives the flexibility to emphasize shorter or longer paths. A large n_bins_head would create shorter paths within the simulation as it would be more likely to throw a head, likewise a small n_bins_head would create more tails leading to longer paths.

A possible application to this is a situation where a Monte Carlo simulation using the actual price measure would not naturally sample important areas of the phase space. For example pricing a far out of the money option with few price paths sampling the interesting high gamma region around the strike. Similarly an option might be either too far or to close to the barrier in order for sufficient number of Monte Carlo paths to sample the knock out or reach a payout.

Simulating stopping times

We can also simulate the effective path length directly. From our initial convergence study is seems a path length of 25 should be sufficient. The following code generates path up to 25 steps with from a uniform sample space and weights each path according to its expectation as given by the actual properties of the fair or biased coins:

% simulating path length
% number of simulations m
m = 100000;
n_bins_total = 25;
w = 1/n_bins_total;
p_head_fair = 1/2;
p_tail_fair = 1-p_head_fair;
p_head_biased = 11/36;
p_tail_biased = 1-p_head_biased;
stopping_time = zeros(1,m);
weights_biased = zeros(1,m);
weights_fair = zeros(1,m);

for i = 1:m
  number_of_tails = unidrnd(n_bins_total) - 1;
  weights_biased(i) = 1/w*p_head_biased*p_tail_biased^number_of_tails;
  weights_fair(i) = 1/w*p_head_fair*(p_tail_fair)^(number_of_tails);

t_stop_fair = mean(stopping_time.*weights_fair);
t_stop_biased = mean(stopping_time.*weights_biased);
pwin_fair = sum(weights_fair(mod(stopping_time,2)==1)) /length(stopping_time);
pwin_biased = sum(weights_biased(mod(stopping_time,2)==1)) /length(stopping_time);

fprintf( "Stopping time for fair coin flip: simulated: %f, calculated %f   \n" , t_stop_fair, 2);
fprintf( "Stopping time for biased coin:    simulated: %f, calculated %f   \n", t_stop_biased, 36/11);
fprintf( "First player win for fair coin:   simulated: %f, calculated %f   \n", pwin_fair,   1/(2-18/36));
fprintf( "First player win for biased coin: simulated: %f, calculated %f   \n", pwin_biased,   1/(2-11/36));

Stopping time for fair coin: simulated: 1.994821, calculated: 2.000000
Stopping time for biased coin: simulated: 3.264265, calculated: 3.272727
First player win for fair coin: simulated: 0.669453, calculated: 0.666667
First player win for biased coin: simulated: 0.590975, calculated: 0.590164

Using the same number of 100,000 Monte Carlo trials the direct simulation of paths is much faster than generating individual coin flips. However as depict in the graph above we get a slower convergence and greater variance. This is possibly a consequence of uniformly sampling short and long paths, while the actual coin process visits short paths much more frequently.

Minimizing variance

This leads to the question of what is the optimal choice of the measure w we use to generate the simulation? Let’s take a step back and generally consider the expectation of a function f(x) over the sample space of a random variable x. In the actual probability measure it is the integral or sum over all possible samples of x applied to the observable f(x) and weighted by the corresponding probability p(x)

E_p[f(x)] = \sum f(x_i) p(x_i).

In Monte Carlo simulation we take the expectation over simulated paths, where the simulation is generated by a different measure w. In order to recover the identical expectation in the new measure averaging must be taken over f(x) multiplied by the likelihood ratio p(x)/w(x).

E_p[f(x)] = E_w[f(x)(p(x)/w(x)) ] = \sum f(x_i)(p(x_i)/w(x_i)) w(x_i) = \sum f(x_i) p(x_i)

The measure w represents the characteristics of the random scenario generator of the Monte Carlo implementation. In the example above it seems that choosing w to be uniform for directly generating stopping times did lead to an increased variance and poorer convergence. What choice for would be optimal? Let’s investigate what choice of w would minimize the variance of the simulation.

The variance of the expectation of f(x) in the measure w is given by

Var_w(f(x)) = E_w[(f(x)*p(x)/w(x))^2] - E_w[f(x)*p(x)/w(x)]^2
= \sum (f_i \frac{p_i}{w_i})^2 w_i - ( \sum f_i \frac{p_i}{w_i} w_i )^2
= \sum f_i^2 \frac{p_i^2}{w_i} - (\sum f_i p_i)^2

We are looking to minimize the variance. Introducing a Lagrange multiplier \lambda to take into to account that the w_i sum up to 1, the condition for the minimum variance can be expressed as

\frac{\partial}{\partial w_i} \left( \sum f_i^2 \frac{p_i^2}{w_i} - (\sum f_i p_i)^2 + \lambda (\sum w_i -1 ) \right) = 0

Taking partial derivatives of the variance with respect to the weights w_i  and solving for the weights

w_i = \frac{f_i p_i}{ \lambda^{1/2} }

Summing over the weights leads to

\sum w_i = 1 = \lambda^{-1/2} \sum { f_i p_i }

\lambda = \left( \sum { f_i p_i } \right)^2

and finally

w_i = \frac{f_i p_i}{ \sum { f_i p_i } }

We find that the optimum choice of simulation measure is the real measure modified by the observable f.

Substituting this choice of weights back into the expression for the variance

Var_w(f(x)) = \sum f_i^2 \frac{p_i^2}{w_i} - (\sum f_i p_i)^2
= \sum \frac{f_i^2 p_i^2}{f_i p_i}  \sum { f_i p_i } - (\sum f_i p_i)^2
= \sum {f_i p_i}  \sum { f_i p_i } - (\sum f_i p_i)^2 = 0

we find it reduces it to zero.

Let’s go back to the problem of estimating the wining probability. Here the winning indicator is only 1 for odd steps, meaning it would be ideal to not simulate paths with even number of steps, and normalize the weight according to the missing probability of even paths. In our case we want to estimate simultaneously the stopping time which needs also the even paths. As an compromise we use weights according to f=1 in order to cover most of the probability space.

Implementing the above leads to

% simulating path length
% number of simulations m
m = 100000;
n_bins_total = 25;
w = (25/36).^((1:n_bins_total )-1)/sum((25/36).^((1:n_bins_total )-1));
cw = cumsum(w);
p_head_fair = 1/2;
p_tail_fair = 1-p_head_fair;
p_head_biased = 11/36;
p_tail_biased = 1-p_head_biased;
stopping_time = zeros(1,m);
weights_biased = zeros(1,m);
weights_fair = zeros(1,m);

for i = 1:m
 number_of_tails = min((1:n_bins_total)(cw-rand(1)>0))-1;
 weights_biased(i) = 1/w(1+number_of_tails)*p_head_biased*p_tail_biased^number_of_tails;
 weights_fair(i) = 1/w(1+number_of_tails)*p_head_fair*(p_tail_fair)^(number_of_tails);

t_stop_fair = mean(stopping_time.*weights_fair);
t_stop_biased = mean(stopping_time.*weights_biased);
pwin_fair = sum(weights_fair(mod(stopping_time,2)==1)) /length(stopping_time);
pwin_biased = sum(weights_biased(mod(stopping_time,2)==1)) /length(stopping_time);

fprintf( "Stopping time for fair coin flip: simulated: %f, calculated %f   \n" , t_stop_fair, 2);
fprintf( "Stopping time for biased coin:    simulated: %f, calculated %f   \n", t_stop_biased, 36/11);
fprintf( "First player win for fair coin:   simulated: %f, calculated %f   \n", pwin_fair,   1/(2-18/36));
fprintf( "First player win for biased coin: simulated: %f, calculated %f   \n", pwin_biased,   1/(2-11/36));

Stopping time for fair coin flip: simulated: 2.000213, calculated 2.000000
Stopping time for biased coin: simulated: 3.274630, calculated 3.272727
First player win for fair coin: simulated: 0.666232, calculated 0.666667
First player win for biased coin: simulated: 0.590285, calculated 0.590164

Starting from a simple flipping coin game we investigated theoretical and numerical solution techniques to compute expectations. This post did not discuss other popular variance reduction methods like the use of antithetic paths or control variates, which could be a topic of future writings.

Posted in Uncategorized | Tagged , , , , , , | 1 Comment

Volatility and Correlation

The implied option volatility reflects the price premium an option commands. A trader’s profit and loss ‘P&L’ from hedging option positions is driven to a large extend by the actual historical volatility of the underlying assets. Thus as option premiums are driven by supply and demand they should reflect to some extend the expectation of future realized asset volatility by market participants.

Bloomberg GV function displays option implied vs. realized volatility. Below for example for the Dow Jones INDU index both realized and implied volatility measures are reasonably aligned

Another interesting aspect in the context of dispersion trading is implied correlation. Implied correlation is a measure on how index volatility compares to the volatility of a basket of the individual index components. A trader can take a position in correlation for example by selling options of the individual component stocks and buying index options. Instead of taking position in options a similar strategy can also be executed using variance swaps.

The Dow Jones index I is comprised of 30 individual stocks Si
I = \sum_{i=1}^{N}{w_i S_i}
with basket weights wi.
The price volatility of the index is determined by the aggregation of individual stock volatilities according to its correlation structure. Let’s consider two extreme cases. In the situation the movement of all basket components are aligned the pairwise correlation between assets is 1. The correlation 1 basket variance V1 is given by the squared sum of weighted asset volatilities
V_1 = (\sum_{i=1}^{N}{w_i \sigma_i})^2
In case all asset movement disperse independent of each other the pairwise correlation between assets is 0. The correlation 0 basket variance V0 is given by the sum of variances
V_0 = \sum_{i=1}^{N}{w_i^2 \sigma_i^2}
The relation of the actual index variance VI to these extreme cases is reflected by the implied correlation ρ
\rho = \frac{V_I-V_0}{V_1-V_0}
The graph below depicts the Dow Jones index implied volatility in the band between the dispersed correlation zero and extreme aligned correlation one cases.

The implied volatility is close to the upper range, indicating a high implied correlation. Plotting implied correlation directly confirms this

The graphs above are generated by downloading historical index and stock option implied volatility data into Excel using the Bloomberg specific functions,

=BDH("AA US EQUITY","PX_LAST",DATE(2010,6,30),DATE(2010,6,30)) 
=BDH("AA US EQUITY","30DAY_IMPVOL_100.0%MNY_DF",DATE(2010,6,30),DATE(2010,6,30))

for index composition, prices and implied option volatilities.

For calculation on the static data refer to the following google-docs spreadsheet, for better visibility right click the below and select ‘Open Frame in New Window’.

Finally let’s look at volatilities and correlations from raw realized asset prices

retrieved using the tawny and quantmod R packages

indx <- getIndexComposition('^DJI')
for(stk in indx) { getSymbols(stk) }

widthNrBusDays <- 90
dayBasis <- 260
indx.ts <- xts()
var0.ts <- xts()
var1.ts <- xts()
divisor <- 0.13213
for(stk in indx) 
if(length(indx.ts) == 0)
        indx.ts <- stk.ts/divisor
        var0.ts <- var.ts*(stk.ts/divisor)^2
        var1.ts <- sqrt(var.ts)*stk.ts/divisor
} else {
       indx.ts = indx.ts + stk.ts/divisor
       var0.ts = var0.ts + var.ts*(stk.ts/divisor)^2
       var1.ts = var1.ts + sqrt(var.ts)*stk.ts/divisor
var0.ts = var0.ts/(indx.ts*indx.ts)
var1.ts = (var1.ts/indx.ts)^2

indx.var.ts <- apply.rolling(diff(log(indx.ts)),width=widthNrBusDays,FUN="var")*dayBasis

indx.corr.ts <- (indx.var.ts - var0.ts)/(var1.ts - var0.ts)
indx.vol.ts <-  sqrt(indx.var.ts)

chart_Series(indx.corr.ts, name='DJI 30 day implied correlation')
chart_Series(indx.vol.ts, name='DJI 30 day realized volatility')
Posted in Uncategorized | Tagged , , , , , , | 4 Comments