Celebrating PI Day with String Art and D3

This week local artisians have been busy with string art projects to celebrate 3/14/2018, which was an inspiration to create a string art maker application that would help to explore shapes and allow to generate design templates.

String art is closely related to Bezier curves that are widely used in computer graphics and animation. Curves and shapes emerge out of a construction process that involves connecting a set of pairwise nodes by lines. In the example here we consider nodes to be placed along the perimeter of a circle and numbered from 0 to N-1. Following a simple construction rule, such as connecting the i-th node with the node given by 21*i interesting patterns emerge.


The application depict below allows for easy exploration of construction expressions and patterns. Push the blue CALC button in the app below to create the graphics. To open the app in the browser use the https://string-circle.firebaseapp.com link.


The code below demonstrates the use of D3 to create the string art graphics. It creates N nodes 0≤i<N that are paired with a node j by a function fj(i) = 21*i+N/2, where j is mapped back to the range 0≤j<N. The graph makes use of the d3.transition() function to create an animation effect in order to better visualize the construction process.

Unfortunately the wordpress blog does not seem to allow active javascript postings, for a demonstration of the javascript code in action see the dublicated posting as an active   observablehq notebook.

let N=300
let T=3000
let height = 500
let width = 500

let fj = (i)=>{ return (((21*i+N/2)%N+N)%N); }

let fx = (i) => { return (Math.cos(2*Math.PI*i/N) * (width/2)+width/2); }
let fy = (i) => { return (-Math.sin(2*Math.PI*i/N) * (height/2)+height/2); }
let svg = DOM.svg(width,height);
let nodes = Array(N).fill().map((_,i)=>i)
let colors = [“cyan”,”pink”,”magenta”,”blue”]
let icolor = 0;
function selectColor(i){
icolor = (icolor+1)%colors.length
return colors[icolor]

let g = d3.select(svg).append(“g”)
.attr(“x1”,(d)=> fx(d) )
.attr(“y1”,(d)=> fy(d) )
.attr(“x2”,(d)=> fx(fj(d)) )
.attr(“y2”,(d)=> fy(fj(d)) )
.attr(“stroke-width”, 1)
.on(“start”, function repeat() {
.attr(“stroke”, (i)=>{return selectColor(i);})
.duration((i)=> {return(50+(T-50)*(N-i)/N)})
.on(“start”, repeat)

Posted in Uncategorized | Leave a comment

Fun with R and graphs on the dawn of 2014

There is some secret message hidden in this graph


Let’s decode it


# Can you traverse the graph of letters to decode the secret message that decrypt's as
# hashcode(message) = 2055263931398168510404

# Function to convert a string to a hash number 
hashcode <- function(s){ 
  sapply(s, function(si){sum(as.numeric(charToRaw(si))*31^( (1:nchar(si))-1)) } )}

# Define a list of vertices and neighbors
# For example from P one can reach P and Y
edges <- list("W"=c(" "),
              "Y"=c(" ","E"),
              "R"=c(" "),
              " "=c("N","H","Y"),

# reformat edgelist into a matrix of paired vertices that is expected by the igraph graphing package
el <-        (do.call(rbind, 
              (Reduce(append,lapply(names(edges), function(e) 
              { lapply(edges[[e]], function(ei){ matrix(c(e,ei),nrow=1)})})) ) ))

# create graph, for plotting only
g <- graph.edgelist(el)
# color edges
E(g)$color <- "grey"

# Paths are represented by strings, for example going from "Y" to "F" and "W" would be "YFW"
# Let's start traversing the path from three starting nodes H, N and Y
paths <- c("H","N","Y")

# Function that looks up the current node given a path
# for example currnode("YFW") returns "W"
currnode <- function(s){substring(s,nchar(s),nchar(s))}

# Starting from the initial points, traverse graph by appending the next vertices 
# according to the edge list until a path length of 13 steps
# For example currnode("PYF") returns "F"  
# edges[["F"]] returns "A" and "W"
# which creates paths "PYFA" and "PYFW"

# Run for 13 steps
while(nchar(paths[[1]])<=13) {
  paths <- Reduce(append,lapply(paths,function(p){ lapply(
    edges[[currnode(p)]] , function(nxt){paste0(p,nxt)} ) } ))

# Check if we found the 'secret message' ?
HNY <- which(hashcode(paths)==2055263931398168510404)
# Print the decoded message
msg <- paths[[HNY]]
#print all paths

# color the solution path 
# (use <<- to assign the color within mapply)
path <- unlist(strsplit(msg,""))
  E(g, path=c(x,y))$color <<- col}, 
                 path[-1],heat.colors(length(path)-1) ))

Resulting in


Wishing everyone a Happy New Year 2014, cheers!

Posted in Uncategorized | Tagged , , | 1 Comment

Portfolio Trading

In finance and investing the term portfolio refers to the collection of assets one owns. Compared to just holding a single asset at a time a portfolio has a number of potential benefits. A universe of asset holdings within the portfolio gives a greater access to potentially favorable trends across markets. At the same time the expected risk and return of the overall holding is subject to its specific composition.

A previous post on this blog illustrated a portfolio selection methodology that screens a universe of possible assets. The object there was to find a particular combination that reduces the expected risk-return ratio (sharp-ratio). The assumption there was an underlying buy-and-hold strategy, that once a particular portfolio is selected its composition is hold static until the final position unwound.

This post concerns with dynamic portfolio trading strategies where the portfolio is periodically rebalanced. At prescribed intervals the composition of the portfolio is changed by selling out existing holdings and buying new assets. Typically it is assumed that reallocation does not change the overall portfolio value, with the possible exception of losses due to transaction costs and fees. Another classification concerning this type of strategies is if borrowings are allowed (long-short portfolio) or if all relative portfolio weights are restricted to be positive (long-only portfolio).

Following closely recent publications by Steven Hoi and Bin Li we first set the stage by illustrating fundamental portfolio mechanics and then detailing their passive aggressive mean reversion strategy. We test drive the model in R and test it with 1/2011-11/2012 NYSE and NASDAQ daily stock data.

Portfolio Trading Terminology

Let’s say we want to invest a notional amount of say \bf P(t=0) = \$ 1,000 into a portfolio and record each day the total portfolio value \bf P(t). The portfolio is composed of members with a per share dollar price of \bf S_i(t). Assuming we hold \bf \Delta_i(t) shares of each component, then the relative portfolio weight for each member is

\bf b_i(t) = \frac{\Delta^{\$}_i(t)}{P(t)} = \frac{\Delta_i(t) S_i(t)}{P(t)}.

The dollar delta \bf \Delta^{\$}_i(t) = \Delta_i(t) S_i(t) expresses the value of each stock holding. Since the aggregated portfolio value is the sum over its components \bf P(t) = \sum_i \Delta_i(t) S_i(t) , the relative portfolio weights are always normalized to \bf \sum_i b_i(t) = 1

Buy and Hold

An instructive example is to look at ideal perfectly antithetic pair of artificial price paths. Both assumes assets alternate between a 100% and 50% return. A static portfolio would not experience any growth after a full round trip is completed. The column \bf x shows the individual relative open to close or close to next day’s close returns.

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-open \bf \$ 200 \atop \$ 50 \bf  \atop  \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000

Dollar Cost Averaging

Dollar cost averaging is a classic strategy that takes advantage of periodic mean reversion. After each close rebalance the portfolio by distributing equal dollar amounts between each component. After the rebalance trades each delta dollar position \bf \Delta^{\$}(t) in the portfolio is the same at the open for each component. Rebalancing itself does not change the total portfolio value. Expressed in terms of the relative portfolio weights \bf b dollar cost averaging is the allocation strategy to rebalance the weight back to the uniform value of the inverse number of components \bf N inside the portfolio.

Dollar cost averaging:
\bf b_i = 1/N

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop  \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-open \bf \$ 200 \atop \$ 50 \bf \atop \bf 0.5 \atop 0.5 \bf 3.125 \atop 12.5 \bf \$ 650 \atop \$ 650 \bf \$ 1,250
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0.2 \atop 0.8 \bf 3.125 \atop 12.5 \bf \$ 312.5 \atop \$ 1,250 \bf \$ 1562.5

Perfect Prediction

With the benefit of perfect foresight its is possible to allocate all weights into the single asset with highest predicted gain.

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop  \bf 1  \atop 0 \bf 10 \atop 0 \bf \$ 1,000 \atop \$ 0 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 1 \atop 0 \bf 10 \atop 0 \bf \$ 2,000 \atop \$ 0 \bf \$ 2,000
\bf 2-open \bf \$ 200 \atop \$ 50 \bf \atop \bf 0 \atop 1 \bf 0 \atop 40 \bf \$ 0 \atop \$ 2,000 \bf \$ 2,000
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0 \atop 1 \bf 0 \atop 40 \bf \$ 0 \atop \$ 4,000 \bf \$ 4,000

Mean Reversion Strategy

The one period portfolio return is given by the product of relative weights and component relative returns as

\bf L(b,x) = \sum b_i x_i

The task of the trading strategy is to determine the rebalancing weights \bf b^{open}(t) for the opening position of day \bf t given yesterdays relative returns at the close \bf x^{close}(t-1) .
A forecast of a positive trend would be that today’s returns are a repeat of yesterdays \bf x^{close}(t) = x^{close}(t-1). In contrast mean reversion would assume that today’s returns are the inverse of yesterdays \bf x_i^{close}(t) = 1 / x_i^{close}(t-1).

The following strategy consists of determining the weights \bf b such that a positive trend assumption would lead to a loss \bf L(b^{open}(t),x^{close}(t-1)) = L_0 with threshold parameter \bf L_0 \leq 1.

Considering returns \bf \Delta x = x-L_0 with

\bf L(b,x) = \sum b_i x_i = L_0
\bf = L_0 \sum_i b_i = \sum b_i (x_i - L_0)
\bf = \sum b_i \Delta x_i = 0

Typically the relative performance vector \bf \Delta x consists of a mix of growing and retracting assets. Then we have a balanced number of positive and negative components within the \bf \Delta x vector. The condition that the product \bf \sum b_i \Delta x_i = 0 chooses an weight vector that is perpendicular to \bf \Delta x and lies on the unit simplex \bf \sum b_i =1 .

Now assume a mean reverting process that consists of a rotation of performances within the asset vector. Graphically the corresponds to a reflection at a \bf 45 degree axis. After rotation the angle between the the weight vector and the new performance vectors reduces. This results in an increase of the portfolio return value.

Thus setting for example \bf L_0 = 1 generates a flat portfolio in case the performance vector does not change while it increases the portfolio value in the case of an actual mean reversion.

The strategy implicitly assumes a price process that rotates between outperforming and lagging components within the portfolio. This does not include the situation where all stocks are out or underperforming at the same time. The long-only portfolio strategy relies on the ability to be able to reallocate funds between winners and losers.

For the two share example we have

\bf b_1 = \frac{L_0-x_2}{x_1-x_2}
\bf b_2 = \frac{L_0-x_1}{x_2-x_1}

using \bf b_1+b_2=1 and \bf b_1 x_1 + b_2 x_2 = L_0

The table below shows the result of computing the portfolio weights at the open of day 2 according to above formulas.

\bf Day \bf S_1 \atop S_2 \bf x_1 \atop x_2 \bf b_1 \atop b_2 \bf \Delta_1 \atop \Delta_2 \bf \Delta^{\$}_1 \atop \Delta^{\$}_2 \bf P
\bf 1-open \bf \$100 \atop \$100 \bf  \atop  \bf 0.5 \atop 0.5 \bf 5 \atop 5 \bf \$ 500 \atop \$ 500 \bf \$ 1,000
\bf 1-close \bf \$ 200 \atop \$ 50 \bf 2 \atop 0.5 \bf 0.8 \atop 0.2 \bf 5 \atop 5 \bf \$ 1,000 \atop \$ 250 \bf \$ 1,250
\bf 2-open \bf \$ 200 \atop \$ 50 \bf \atop \bf 0.33 \atop 0.66 \bf 2.08 \atop 16.67 \bf \$ 416.67 \atop \$ 833.33 \bf \$ 1,250
\bf 2-close \bf \$ 100 \atop \$ 100 \bf 0.5 \atop 2 \bf 0.11 \atop 0.89 \bf 2.08 \atop 16.67 \bf \$ 208.33 \atop \$ 1,666,67 \bf \$ 1,875

With more than two assets the loss condition together with the normalization is not sufficient to uniquely determine the portfolio weights. The additional degrees of freedoms can be used to allow the posing of additional criteria one wishes the strategy to observe. A good choice seems to be requiring that the new portfolio weights are close to the previous selection, as this should minimize rebalance transaction costs.

Minimizing the squared distance between the weights under the normalization and loss constraint leads to the following expression for the new weight

\bf b(t) = b(t-1) - \tau(t-1) ( x(t-1) - {\bar x}(t-1) )

as a function of the previous portfolio weights \bf b(t-1) , the return vector \bf x(t-1) and the average return across the portfolio components \bf {\bar x}(t-1) = \frac{1}{n}\sum_i x_i(t-1) .

The control gain \tau is taken as the ratio of the access loss over the return variance

\bf \tau(t-1) = \frac{ max(0,b(t-1)x(t-1)-L_0 }{ var(x(t-1)) }


This post contains the R code to test the mean reversion strategy. Before running this the function block in the appendix need to be initialized. The following code runs the artificial pair of stocks as a demonstration of the fundamental strategy

# create the antithetic pair of stocks and publish it into the environment
# generate sequence of dates
times <- seq(as.POSIXct(strptime('2011-01-1 16:00:00', '%Y-%m-%d %H:%M:%S')),
                      as.POSIXct(strptime('2011-12-1 16:00:00', '%Y-%m-%d %H:%M:%S')),
                      by="1 months")

# generate and store dummy price series = 200,100,200,100,200 ...
prices.xts <- xts(rep(c(200,100),length(times)/2), order.by=as.POSIXct(times))
stk <- 'STK1'
colnames(prices.xts) <- paste(stk,'.Adjusted',sep="")
assign(x=stk, value=prices.xts, envir=global.env )

# generate and store dummy price series = 100,200,100,200,100 ...
prices.xts <- xts(rep(c(100,200),length(times)/2), order.by=as.POSIXct(times))
stk <- 'STK2'
colnames(prices.xts) <- paste(stk,'.Adjusted',sep="")
assign(x=stk, value=prices.xts, envir=global.env )

# run the strategy and create a plot
strategy(portfolio= c("STK1","STK2"),threshold=.9,doplot=TRUE,title="Artificial Portfolio Example")

As a benchmark I like always to testing against a portfolio of DJI member stocks using recent 2011 history. The strategy does not do very well for this selection of stocks and recent time frame.

As an enhancement screening stocks by ranking according to their most negative single one-day auto-correlation. The 5 best ranked candidates out of the DJI picked this way are “TRV”,XOM”,”CVX”,PG” and “JPM” .
Rerunning the strategy on the sub portfolio of these five stocks gives a somewhat better performance.

Again, the following code is use to load DJI data from Yahoo and run the strategy

portfolio.DJI <- getIndexComposition('^DJI')
strategy(portfolio.DJI,.9,TRUE,"2011-01-01::2012-11-01",title="Portfolio DJI, Jan-2011 - Nov 2012")
strategy(c("TRV",XOM","CVX",PG","JPM"),.9,TRUE,"2011-01-01::2012-11-01",title="Portfolio 5 out of DJI, Jan-2011 - Nov 2012")

Next a screening a universe of about 5000 NYSE and NASDAQ stocks for those with the best individual auto-correlation within the 2011 time frame. Then running the strategy for 2011 and out of sample in Jan-Nov 2012, with either the 5 or the 100 best names. The constituent portfolio consists of “UBFO”,”TVE”,”GLDC”,”ARI”,”CTBI”.


We test drove the passive aggressive trading strategy on recent daily price data. While not doing very well on a mainstream ticker like DJI, pre-screening the stock universe according to a auto-correlation did improve results, even when running the strategy out of sample for the purpose of cross validation. Other extensions, like including a cash asset or allowing for short positions require further investigation.

The strategy has interesting elements, such as minimizing the chance in the portfolio weights and the use of a loss function. It does only directly take the immediate one-period lag into account. For use in other time domains, like high frequency, one could consider having a collection of competing trading agents with a range of time lags that filter out the dominate mean reverting mode.

There are a verity of approaches in the literature that I like to demonstrate as well in the framework of this blog, stay posted.

Appendix: R functions


# global market data environment
global.env <- new.env()

# search anti correlated stocks
# also return daily variance and maximum one day variance
portfolio.screen <- function(portfolio, daterange="")
   ac <- matrix()
   for(stk in portfolio)
        p.xts <- Ad(get(stk,global.env)[daterange])
        dr.xts <- dailyReturn(p.xts)
        dr.maxvar <- max( (dr.xts-mean(dr.xts) )^2)
        dr.var <- coredata(var(dr.xts))
        dr.var.ratio <- dr.maxvar/(dr.var)
        dr.ac <-  coredata( acf(dr.xts)$acf[1:5] )
        dr.ac.res <- c(stk, dr.var.ratio, dr.maxvar, dr.var, dr.ac)

          ac <- dr.ac.res
          ac <- rbind(ac, dr.ac.res)
# sort by decreasing auto-correlations
ac_sort <- (ac[order(as.numeric(ac[,6]),decreasing='FALSE'),])
# return sorted matrix of stocks and correlations

# portfolio weight vector b
# price relative return vector x
# threshold parameter e, usually e <= 1 for

# returns intrinsic value of call on portfolio return with strike e
strategy.loss <- function(b, x, e)

strategy.gearing <- function(b, x, e)
    loss <- strategy.loss(b,x,e)
    varx <- var(x)
    if(varx>0) { -loss/varx }
    else { 0}

# return a vector w so that w is close to v
# by minimizing the distance (w-v)^2
# with the condition that w is on the simplex
# sum(w)=1
strategy.simplex <- function(v,z)
    # initialize projection to zero
    w <- rep(0, length(v) )
    # Sort v into u in descending order

    idx <- order(v,decreasing=TRUE)
    u <- v[idx]

    #index of number of non zero elements in w
    rho <-max(index(u)[index(u)*u  >= cumsum(u) -z])
       theta <-(sum(u[1:rho])-z)/rho
       w[idx[1:rho]] <- v[idx[1:rho]] - theta;

strategy.rebalance <- function(b,x,e)
    dbdx <- strategy.gearing <- strategy.gearing(b, x, e)
    b_new <- b + dbdx*(x-mean(x))
    b_new <- strategy.simplex(b_new,1)

# Simulate the portfolio trading strategy
# Arguments:
# portfolio : portfolio of stocks f
# threshold: mean reversion loss parameter ( should be in the order but smaller than 1)
# doplot: create an optional plot
# daterange: range of dates to test
# returns the accumulated gain of the strategy
strategy <- function(portfolio, threshold=1, doplot=FALSE,daterange="",title="" )
   portfolio.prices.xts <- xts()
   # raw portfolio components price time series
   for( stk in portfolio)
       portfolio.prices.xts <- cbind(portfolio.prices.xts, Ad(get(stk,envir = global.env)[daterange]))
   # downfill missing closing prices (last observation carried forward)

   portfolio.prices.xts <- na.locf(portfolio.prices.xts)
    # not backfill missing history backwards, for example for IPO assets that did
    # not exist, we set these to the constant initial 'IPO' price , resembling a cash asset
    portfolio.prices.xts <- na.locf(portfolio.prices.xts,fromLast=TRUE)
    times <- index(portfolio.prices.xts) 
    nprices <- NROW(portfolio.prices.xts)

     # relative prices S(t)/S(t-1)
     portfolio.price.relatives.xts <- (portfolio.prices.xts/lag(portfolio.prices.xts,1))[2:nprices,]

     # initialize portfolio weights time series
     portfolio.weights.xts <- xts(matrix(0,ncol=length(portfolio),nrow=nprices-1),

     colnames(portfolio.weights.xts) = portfolio

     # initialize strategy with equal weights
     portfolio.weights.xts[1,] = rep(1/length(portfolio),length(portfolio))

     # run strategy:
     for( i in 1:(nprices-2))
        #portfolio weights at opening of the day
        b_open <- portfolio.weights.xts[i,]

        #price relative return at close x = S(t)/S(t-1)
        x_close <- portfolio.price.relatives.xts[i,]

       # compute end of day rebalancing portfolio with strategy
       b_new <-  strategy.rebalance(as.vector(b_open),as.vector(x_close),threshold)

       # assign portfolio composition for next day opening
       portfolio.weights.xts[i+1,] <- b_new

   #aggregate portfolio returns

    portfolio.total.relatives.xts <- xts( rowSums(portfolio.weights.xts * portfolio.price.relatives.xts), order.by=as.POSIXct(times[2:nprices]))
     portfolio.cum.return.xts <- cumprod( portfolio.total.relatives.xts)

    # cummulative wealth gained by portfolio

      Times <- times

      Performance=c(1,coredata(portfolio.cum.return.xts)), pane='Portfolio')
      myplot <- ggplot()+theme_bw()+
      facet_grid (pane ~ ., scale='free_y')+
      labs(title=title) +
      ylab("Cumulative Performance ") +
      geom_line(data = df,  aes(x = Date, y = Performance), color="blue")

      for(i in 1:length(portfolio))
        df1<-data.frame(Date=Times, Performance=c(1,coredata(cumprod(portfolio.price.relatives.xts[,i] ) ) ),pane='Components')
        names(df1)[2] <- "Performance"
        myplot <- myplot +
        geom_line(data = df1, aes_string(x = "Date", y = "Performance"),color=i)

        df3<-data.frame(Date=Times, Performance=c(1,coredata(cumprod(portfolio.price.relatives.xts[,i] ) ) ), pane='Components')
         names(df1)[2] <- "Performance"
         myplot <- myplot +
         geom_line(data = df1, aes_string(x = "Date", y = "Performance"),color=i)

          Weights=coredata(portfolio.weights.xts[,i]) , pane='Weights')

          names(df2)[2] <- "Weights"

           myplot <- myplot + geom_line(data = df2, aes_string(x = "Date", y = "Weights"),color=i)
} # end of plotting

# return performance
Posted in Uncategorized | Tagged , , , , , , , | 6 Comments

Portfolio Selection and Optimization

Let’s look at the task of selecting a portfolio of stocks that optimize a particular measure of performance. In the structured product setting one might want to compose a portfolio to be used as a reference index for a derivative, with the objective that the index needs a specifically high or low correlation structure, since the overall correlation affects the volatility of the final product. Similarly an individual investor might like to select a number of stocks in order to compose a portfolio that historically had a low volatility and a high growth rate. Yet another objective could be to find portfolio with a specific correlation behavior with respect to a particular target.

To show a particular example, let’s try to select a portfolio composed out of four assets out of a larger universe of around 5,000 NYSE and NASDAQ names that would have the highest possible sharp ratio in the year of 2011. The sharp ratio measures the ratio of average daily returns over its standard derivations and is a widely used performance measure.

The attached R code runs the problem against a small universe of DJI stocks, in order not to require uploading of a large number of market data. For this exercise I don’t correct the stock growth for the risk-free rate and consider long only stock combinations.

For this task one must first be able to calculate the relative weights for a given set of four stocks that optimize the sharp ratio. In the example code the optimizer solnp from the R package Rsolnp is used, instead of a specialized function from a financial package, in order to retain the flexibility of optimizing other non standard measures.

There are a number of approaches that could be used for stock selection. The brute force method would be to iterate over all possible combinations, which quickly becomes a very large problem to solve with limited computational resources. For example to check all four name portfolios out of a universe of N stocks would require N*(N-1)*(N-2)*(N-3)/24 combinations.

On the other end on can use a bootstrapping method, by first selecting the stock with the single best sharp ratio. Finding this stock requires N computations. Next checking all two stock portfolios that are composed out of the best stock and all possible selections from the universe adds N-1 calculations. Repeating this sequence of adding just single names at a time leads first to a three name and finally to a four name portfolio after a total of 4*N-6 steps. Having arrived at an already half-way optimized four name portfolio one can then try to improve it further by rotating out individual names against the universe of stocks.

A systematic variation to the try and error bootstrapping leads to the tree based search method illustrated in the following.

The picture below for the universe of 30 DJI stocks starts off at the bottom node with the portfolio the stocks “JPM”, “HPQ”, “AA”, BAC” with a low performance sharp ratio of -0.42.  From this the tree search generates four sub portfolio combinations, (“HPQ”, “AA”, BAC”), (“JPM”,”AA”, BAC”),(“JPM”, “HPQ”, BAC”) and (“JPM”, “HPQ”, “AA”) by removing one member each. Then it linear-searches among the universe of stocks for the best name to add to each sub portfolio in order to arrive at a new set of four-name portfolios.  The one with the best sharp ratio is then selected. In the graph below this was the combination (“AA”, “BAC”,”HPQ”,”MCD”). It becomes the root node of the new search tree branch. Eventually the tree search ends when no new combinations with higher sharp ratio can be found along the deepest search path.


The following shows a similar search tree, now using the full 5000 NYSE and NASDAQ stocks for that we got sufficient historical data without missing dates.


The graphs below depict the historical cumulative performance of the optimized portfolio against the performance of the initial portfolio. Note that the search among the 5000 stock universe yielded a growth portfolio with much smaller variance than just selecting among the 30 DJI stocks.

Further optimization approaches are Monte Carlo like tactics of selecting random starting points for multiple searches. Other extensions are genetic algorithms that create new trial portfolios out the genetic crossings of two parent portfolios. Overall the tree approach here seemed to be robust against varying the selection of the starting root portfolio and converges after few search nodes.

This post is mainly to demonstrate the optimization technique rather than to advertise a particular portfolio selection. In reality a portfolio that performed particularly well in one year is not guaranteed to have similar characteristics in the following year. A more complete study would include back-testing and cross-validation of the selected portfolios against data that was not used during optimization.

To run the tree search first create the following functions in R by copying and pasting

# load list of symbols via getSymbols()
# enclose fetching symbols within try clause in case of faulty/non-existing tickers
# disregard symbols without a complete history
# return list of successful loaded names
symbol.exists <- function(stk)
     tryCatch( typeof(get(stk,env=global.env))=="double", error = function(e) FALSE )
symbols.load <- function(indx.master)
  indx <- c() # list of loaded tickers
  for(stk in indx.master)
     cc <- 0
     while( symbol.exists(stk) == FALSE & cc < 2 )
       cc <- cc +1   
       tryCatch( getSymbols(stk, from="2011-01-03", to="2011-12-30",env=global.env), error = function(e) Sys.sleep(1) )
         indx<-c(indx,stk)  }}

# check master symbol list
# return clean list of symbols that are successfully loaded
symbols.clean <- function(indx.master)
  indx <- c()
  for(stk in indx.master){ if(symbol.exists(stk)==TRUE) {
            indx<-c(indx,stk) }}}

# compute list of sharp ratio for each name in the indx list
sharp.ratios <-function(indx)
 stks.dailyret.ts <- xts()
 cc <-1
 for(stk in indx){
   stk.ts <- Ad(get(stk,env=global.env)['2011-01-03::2011-12-30'])
   stk.cumret.ts <- stk.ts/coredata(stk.ts[1])[1]
   stk.dailyret.ts <- tail(stk.cumret.ts/lag(stk.cumret.ts,k=1)-1,-1)
   stks.dailyret.ts <- merge(stks.dailyret.ts, stk.dailyret.ts)
   stk.mean <- mean(stk.dailyret.ts)
   stk.var <- var(stk.dailyret.ts)
   stk.sharp <- sqrt(252)*stk.mean/sqrt(stk.var)
   results[cc,] <-  c(stk, stk.mean, sqrt(stk.var), stk.sharp)
  cc <- cc+1

# compute sharp ratio and optimum combination for names in stks
sharp.ratio <- function(stks)
  stks.dailyret.ts <- xts()
  for(stk in stks ){
   stk.ts <- Ad(get(stk,global.env)['2011-01-03::2011-12-30'])
   stk.cumret.ts <- stk.ts/coredata(stk.ts[1])[1]
   stk.dailyret.ts <- tail(stk.cumret.ts/lag(stk.cumret.ts,k=1)-1,-1)
   stks.dailyret.ts <- merge(stks.dailyret.ts, stk.dailyret.ts)
 dcount <- sqrt(252)
 V <- cov(stks.dailyret.ts)
  res <- solnp(
  rep(1/length(mu), length(mu)),
  function(w) - dcount*t(w) %*% mu / sqrt( t(w) %*% V %*% w ),
  eqfun = function(w) sum(w),
  eqB   = 1,
  LB = rep(0, length(mu))
  res$mu <-  t(res$pars) %*% mu
  res$std <-  sqrt( t(res$pars) %*% V %*% res$pars )
  res$sharp <- -tail( res$values,1)

sharp.simple.ratio <- function(stks)
 stks.cumret.ts <- xts()
 for(stk in stks ){
  stk.ts <- Ad(get(stk,global.env)['2011-01-03::2011-12-30'])
  stk.cumret.ts <- stk.ts/coredata(stk.ts[1])[1]
  stks.cumret.ts <- merge(stks.cumret.ts, stk.cumret.ts)
dcount <- sqrt(252)
res <- solnp(
  rep(1/length(stks), length(stks)),
  function(w) {
   stks.dailyret.ts <-  stks.cumret.ts %*% w
   n <- length(stks.dailyret.ts)
   stks.dailyret.ts <- stks.dailyret.ts[2:n]/stks.dailyret.ts[1:n-1]-1
   - dcount*  mean(stks.dailyret.ts)/  sqrt(var(stks.dailyret.ts))
  eqfun = function(w) sum(w),
  eqB   = 1,
   LB = rep(0, length(stks))
stks.dailyret.ts <-  stks.cumret.ts %*% res$par
n <- length(stks.dailyret.ts)
stks.dailyret.ts <- stks.dailyret.ts[2:n]/stks.dailyret.ts[1:n-1]-1
res$mu <- mean(stks.dailyret.ts)
res$std <-  sqrt(var(stks.dailyret.ts))
res$sharp <- -tail( res$values,1)

# scan universe for best stk to add to portfolio
portfolio.add <- function(portfolio, universe)
  # check home many of the portfolio stocks are also part of the universe
  noverlap <- sum(sapply(1:length(portfolio), function(i) any(portfolio[i]==universe)))
  # allocate result matrix
  cc <- 1
  for (i in 1:n)   if(any(portfolio==universe[i])==FALSE)
   res <- c(sprintf("%s",stks), sr$mu, sr$std, sr$sharp )
   results[cc,] = res
   cc <- cc+1
results.ordered <- results[order(as.numeric(results[,ncol(results)]),decreasing=TRUE),]
node.id <- function(portfolio)
  ratio <- sprintf("%2.2f",as.numeric(sharp.ratio(portfolio)$sharp))

portfolio.visited <- function(portfolio, visited)
  any(portfolio[1]==visited[,1] & portfolio[2]==visited[,2] & portfolio[3]==visited[,3] )
portfolio.visited4 <- function(portfolio, visited)
 any(portfolio[1]==visited[,1] & portfolio[2]==visited[,2] & portfolio[3]==visited[,3], portfolio[4]==visited[,4] )

and then execute the following

global.env <- new.env()
indx.master <- getIndexComposition('^DJI')
universe <- symbols.clean(indx.master)
portfolio <- c("JPM", "HPQ", "AA", "BAC")

mytree <- (read.tree(text = paste("(:1,:1,:1,",node.id(portfolio),":1):1;")))
mytree$node.label <- "ROOT"

visited <- c()
sharp.max <- -999999
search.results <- c()
cont <- TRUE
while(cont == TRUE)
    # create combinations with of subportfolios with 3 names
    portfolios <- list( c(portfolio[2],portfolio[3],portfolio[4]),
                         c(portfolio[1],portfolio[2],portfolio[3]) )
    # order each subportfolio alphabetical
    portfolios <- lapply(1:4, function(i) {
           portfolios[[i]] [ order( portfolios[[i]]     ) ]  }  )

i.portfolio.best <- 0
my.portfolio.best <- c()
mytree.txt <- "("
visited4 <- c()
for(i in 1:4)
         visited <-  rbind(visited,portfolios[[i]])
         my.results  <- portfolio.add(portfolios[[i]], universe)    
         my.results.best <- my.results[1,1:7]
         my.sharp <- my.results[1,7]
         search.results <- rbind( search.results , my.results.best)
         # check if we got this node already
         my.results.best.ordered <-  my.results.best[1:4] [ order( my.results.best[1:4]  ) ]
           visited4 <-  rbind(visited4, my.results.best.ordered)
           if(i>1) { mytree.txt <- paste(mytree.txt,",") }
           mytree.txt <- paste(mytree.txt, node.id(my.results.best[1:4]),":1")

           if(my.sharp > sharp.max){
              sharp.max <- my.sharp
              i.portfolio.best <- i
              my.portfolio.best <- my.results[1,1:4]}

mytree.txt <- paste(mytree.txt,"):1;")
if( i.portfolio.best > 0 )
    mytree.sub <-  read.tree(text = mytree.txt)
    mytree.sub$node.label <- node.id( portfolio)
    mytree <- bind.tree(mytree, mytree.sub,
    where=which(mytree$tip.label == node.id(portfolio) ), position=0)


    portfolio <- my.portfolio.best

if(i.portfolio.best == 0 )
    cont <- FALSE

mytree.drop <- drop.tip(mytree,1:2)
Posted in Uncategorized | Tagged , , , | Leave a comment

Picturing Trees



In this post I like to illustrate the R package “ape” for phylogenetic trees for the purpose of assembling trees. The function read.tree creates a tree from a text description. For example the following code creates and displays two elementary trees:

tree.top <- (read.tree(text = "(TIP1:1,TIP2:2,TIP3:3):1;"))
tree.top$node.label <- "ROOT"

tree.child <- (read.tree(text = "(TIP4:1,TIP5:2):1;"))
tree.child$node.label <- "CHILD"



The required syntax for the read.tree text input is ( TIP_NAME : TIP_BRANCH_LENGTH , … ) : ROOT_BRANCH_LENGTH;

Note that for better illustration in the example above the length are chosen differently for the branches that lead to each leaf or tip.

Next the function bind.tree is used to attach the child tree to the root tree. The argument where  determines the position within the parent root tree where the child branch should be attached. In the first example below the child branch is attached to the branch of the ‘TIP1″ leaf. In this process the receiving leaf label is replaced by the new branch. In order to retain the receiving leaf label, I set the node.label attribute of the incoming child tree to its name.


tree.child$node.label <- "TIP1"
tree.combined <- bind.tree(tree.top, tree.child, where=which(tree.top$tip.label == tree.child$node.label ), position=0)

tree.child$node.label <- "TIP2"
tree.combined <- bind.tree(tree.top, tree.child, where=which(tree.top$tip.label == tree.child$node.label ), position=0)

tree.child$node.label <- "TIP3"
tree.combined <- bind.tree(tree.top, tree.child, where=which(tree.top$tip.label == tree.child$node.label ),position=0)


This procedure of adding branches to leafs can be repeated to build more and more complex trees. Happy gardening !


Posted in Uncategorized | Tagged , , | Leave a comment

A short post about online educational resources on machine learning. Perhaps it is a sign of increasing popularity of the field that there are now several courses on machine learning accessible online and for free. 

Out of Standford University comes Andrew Ng’s almost ‘legendary’ machine learning lecture presented on the Coursera platform. The class consists of video lectures, multiple choice style review quizzes and programming projects in Matlab. A discussion forum supplies student interaction that otherwise only a traditional university program offers. The class covers a good verity of supervised and unsupervised learning concepts, like neural nets, support vector machines, k-means and recommender systems.

Another online course is Berkeley’s CS188.1x Artificial Intelligence out of the EdX platform by Harvard and MIT. This course offers video lectures on topic like decision theory and reenforcement learning with programming assignments in Python.

A third offering in this direction is Sebasitan Thrun’s course on Artificial Intelligence presented at Udacity. Here the lectures are themed around applications to robotics on the example of Google’s driverless car . Udacity’s takes an interesting educational approach. It presents lectures in short sequences that are frequently interrupted by quick multiple choice style review questions. The purpose is to keep the students attention focused on the lecture and provide the feel of real teacher interaction. Similar to the EdX course programming projects are done in Python. Here the student can paste his code directly into the Udacity browser window for instant feedback, without actually having to run Python on the local machine.

Another platform to watch is Kahn Academy that originally started at the elementary to high school education but appears to be destined to expand into the university and professional level.

With the cost of learning at traditional universities exploding, at least in some parts of the world, it is fascinating to observe the formation of free or low cost online educational platforms. Possible revenue models for these platforms could be that lectures are going to be offered for free but a payment is requested for tests and certifications. There are already professional designations like the chartered financial analyst (CFA) or the FRM for risk management that are based on certifications outside the traditional university education setting. Interestingly a lot of development in the e-learning area comes out of the traditional schools, as a successful online program is certainly a great advertisement and recruitment tool by itself.

Link | Posted on by | Tagged , , , , , , | 1 Comment

Learning Kernels SVM

Machine Learning and Kernels

A common application of machine learning (ML) is the learning and classification of a set of raw data features by a ML algorithm or technique. In this context a ML kernel acts to the ML algorithm like sunshades, a telescope or a magnifying glass to the observing eye of a student learner. A good kernel filters the raw data and presents its features to the machine in a way that makes the learning task as simple as possible.

Historically a lot of progress in machine learning has been made in the development sophisticated learning algorithms, however selecting appropriate kernels remains a largely manual and time consuming task.

This post is inspired by a presentation by Prof. Mehryar Mohri about learning kernels. It reflects on the importance of kernels in support vector machines (SVM).

A total of three examples are presented. A linear kernel is shown to solve the first example but fails for the second task. There a square kernel is successful. Then a third example is presented for that both linear and square kernels are not sufficient. There a successful kernel can be generated out of a mixture of both base kernels. This illustrates that kernels can be generated out of bases, resulting in products that are more powerful in solving the task at hand than each individual components.

Support Vector Machines

Consider a support vector machine (SVM) for a classification task. Given a set of pairs of feature data-point vectors x and classifier labels y={-1,1}, the task of the SVM algorithm is to learn to group features x by classifiers. After training on a known data set the SVM machine is intended to correctly predict the class y of an previously unseen feature vector x.

Applications in quantitative finance of support vector machines include for example predictive tasks, where x consists of features derived from a historical stock indicator time series and y is a sell or buy signal. Another example could be that x consist of counts of key-words within a text such as an news announcements and y categorizes it again according on its impact to market movements. Outside of finance a text based SVM could be used to filter e-mail to be forwarded to either the inbox or the spam folder.

Linear Kernel

As indicated above the SVM works by grouping feature points according to its classifiers.
For illustration in the toy example below two dimensional feature vectors x={x1,x2} are generated in such a way that the class y=-1 points (triangles) are nicely separated from the class y=1 (circles).

The SVM algorithm finds the largest possible linear margin that separates these two regions. The marginal  separators rest on the outpost points that are right on the front line of their respective regions. These points, marked as two bold triangles and one bold circle in the picture below, are named the ‘support vectors’ as they are supporting the separation boundary lines. In fact the SVM learning task fully consists of determining these support vector points and the margin distance that separates the regions. After training all other non-support points are not used for prediction.


In linear feature space the support vectors add to an overall hypothesis vector h, h = \sum c_i x_i, such that the classification frontiers are given by the lines h x + b = 1 and h x + b = -1 centered around h x + b = 0.

The code below utilizes the ksvm implementation in the R package ‘kernlab’, making use of “Jean-Philippe Vert’s” tutorials for graphing the classification separation lines.


kfunction <- function(linear =0, quadratic=0)
  k <- function (x,y)
     linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
  class(k) <- "kernel"

n = 25
a1 = rnorm(n)
a2 = 1 - a1 + 2* runif(n)
b1 = rnorm(n)
b2 = -1 - b1 - 2*runif(n)
x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))
svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(1,0),scaled=c())
plot(c(min(x[,1]), max(x[,1])),c(min(x[,2]), max(x[,2])),type='n',xlab='x1',ylab='x2')
title(main='Linear Separable Features')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w <- colSums(coef(svp)[[1]] * x[SVindex(svp),])
b <- b(svp)
# Draw the lines

Quadratic Kernel

The following example illustrates a case where the feature points are non-linear separated. Points of the class y=1 (circles below) are placed in a inner region surrounded from all sides by points of class y=-1, again depicted as triangles. In this example there is no single straight (linear) line that can separate both regions. However here it is still possible to find such a separator by transforming the points x=\{x_1, x_2\}  from feature space to a quadratic kernel space with points given by the corresponding square coordinates \{x_1^2, x_2^2\}.

The technique of transforming from feature space into a measure that allows for a linear separation can be formalized in terms of kernels. Assuming  \Phi() is a vector coordinate transformation function. For example a squared coordinate space would be \{ \Phi(x)_1 , \Phi(x)_2 \} = \{ x_1^2 , x_2^2 \}. The SVM separation task is now acting on in the transformed space to find the support vectors that generate

h \Phi(x) + b = \pm 1

for the hypothesis vector h = \sum c_i \Phi(x_i) given by the sum over support vector points x_i. Putting both expressions together we get

\sum c_i K(x_i,x) + b = \pm 1

with the scalar kernel function K(x_i,x) = \Phi(x_i) \Phi(x). The kernel is composed out of the scalar product between a support vector x_i and another feature vector point x in the transformed space.

In practice the SVM algorithm can be fully expressed in terms of kernels without having to actually specify the feature space transformation. Popular kernels are for example higher powers of the linear scalar product (polynomial kernel). Another example is a probability weighed distance between two points (Gaussian kernel).

Implementing a two dimensional quadratic kernel function allows the SVM algorithm to find support vectors and correctly separate the regions. Below the graph below illustrates that the non-linear regions are linearly separated after transforming to the squared kernel space.


The “R” implementation makes use of ksvm’s flexibility to allow for custom kernel functions. The function ‘kfunction’ returns a linear scalar product kernel for parameters (1,0) and a quadratic kernel function for parameters (0,1).

kfunction <- function(linear =0, quadratic=0)
  k <- function (x,y)
     linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
  class(k) <- "kernel"

n = 20
r = runif(n)
a = 2*pi*runif(n)
a1 = r*sin(a)
a2 = r*cos(a)
r = 2+runif(n)
a = 2*pi*runif(n)
b1 = r*sin(a)
b2 = r*cos(a)
x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))
svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(0,1),scaled=c())
plot(c(min(x[,1]), max(x[,1])),c(min(x[,2]), max(x[,2])),type='n',xlab='x1',ylab='x2')
title(main='Feature Space')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w2 <- colSums(coef(svp)[[1]] * x[SVindex(svp),]^2)
b <- b(svp)

x1 = seq(min(x[,1]),max(x[,1]),0.01)
x2 = seq(min(x[,2]),max(x[,2]),0.01)

points(-sqrt((b-w2[1]*x2^2)/w2[2]), x2, pch = 16 , cex = .1 )
points(sqrt((b-w2[1]*x2^2)/w2[2]), x2, pch = 16 , cex = .1 )
points(x1, sqrt((b-w2[2]*x1^2)/w2[1]), pch = 16 , cex = .1 )
points(x1,  -sqrt((b-w2[2]*x1^2)/w2[1]), pch = 16, cex = .1 )

points(-sqrt((1+ b-w2[1]*x2^2)/w2[2]) , x2, pch = 16 , cex = .1 )
points( sqrt((1 + b-w2[1]*x2^2)/w2[2]) , x2,  pch = 16 , cex = .1 )
points( x1 , sqrt(( 1 + b -w2[2]*x1^2)/w2[1]), pch = 16 , cex = .1 )
points( x1 , -sqrt(( 1 + b -w2[2]*x1^2)/w2[1]), pch = 16, cex = .1 )

points(-sqrt((-1+ b-w2[1]*x2^2)/w2[2]) , x2, pch = 16 , cex = .1 )
points( sqrt((-1 + b-w2[1]*x2^2)/w2[2]) , x2,  pch = 16 , cex = .1 )
points( x1 , sqrt(( -1 + b -w2[2]*x1^2)/w2[1]), pch = 16 , cex = .1 )
points( x1 , -sqrt(( -1 + b -w2[2]*x1^2)/w2[1]), pch = 16, cex = .1 )

xsq <- x^2
svp <- ksvm(xsq,y,type="C-svc",C = 100, kernel=kfunction(1,0),scaled=c())

plot(c(min(xsq[,1]), max(xsq[,1])),c(min(xsq[,2]), max(xsq[,2])),type='n',xlab='x1^2',ylab='x2^2')
title(main='Quadratic Kernel Space')
ymat <- ymatrix(svp)
points(xsq[-SVindex(svp),1], xsq[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(xsq[SVindex(svp),1], xsq[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w <- colSums(coef(svp)[[1]] * xsq[SVindex(svp),])
b <- b(svp)
# Draw the lines

Alignment and Kernel Mixture

The final exploratory  feature data set consists again of two classes of points within two dimensional space. This time two distinct regions of points are separated by a parabolic boundary, where vector points of class y=1 (circles) are below and y=-1 (triangles) are above the separating curve. The example is selected for its property that neither the linear nor the quadratic kernels alone are able to resolve the SVM classification problem.

The second graph below illustrates that feature points of both classes are scattered onto overlapping regions in the quadratic kernel space. It indicates that for this case the sole utilization of the quadratic kernel is not enough to resolve the classification problem.

In “Algorithms for Learning Kernels Based on Centered Alignment” Corinna Cortes, Mehryar Mohri and Afshin Rostamizadeh compose mixed kernels out of base kernel functions. This is perhaps similar to how a vector can be composed out of its coordinate base vectors or a function can be assembled in functional Hilbert space.

In our example we form a new mixed kernel out of the linear and the quadratic kernels K_1 and K_2

K_M(x,z) = \rho_1 K_1(x,z) + \rho_2 K_2(x,z)

The graph below demonstrate that the mixed kernel successfully solves the classification problem even thought each individual base kernels are insufficient on its own. In experimentation the actual values of the mixture weights \rho_1, \rho_2 are not critical. Following the study above we choose the weights according to how much each individual base kernel on its individual aligned with the raw classification matrix Y = y^T y composed out of the classifier vector y.

Alignment is based on the observation that a perfectly selected kernel matrix K_{i,j} = K(x_i,x_j) would trivially solve the classification problem if its elements are equal to the Y matrix. In that case choosing an arbitrary vector x_1 as support the the kernel K(x_1, x_i) would give 1 if x_1 and x_i are in the same category and -1 otherwise.

Similar to the concept of correlation, kernel alignment between two kernels is defined by the expectation over the feature space, resulting in matrix products

\rho ( K^c_1 , K^c_2 ) =  \frac{ < K^c_1 , K^c_2  > }{ \sqrt{ < K^c_1 , K^c_1  > < K^c_2 , K^c_2  > } }

with the product $latex = Tr[A^T B]$. The expectation is take assuming centered n \times n kernel matrices according to K^c = (I-(11^T)/n)K(I-(11^T)/n). (For details see reference above.)


Kernel matrix centering and alignment functions are implemented in the ‘R’ as following


n = 100
a1 = 4*runif(n)-2
a2 = 1 - a1^2 - 0.5 - 2*runif(n)
b1 = 4*runif(n)-2
b2 = 1 - b1^2 + 0.5 + 2*runif(n)

x = rbind(matrix(cbind(a1,a2),,2),matrix(cbind(b1,b2),,2))
y <- matrix(c(rep(1,n),rep(-1,n)))

kfunction <- function(linear, quadratic)
  k <- function (x,y)
     linear*sum((x)*(y)) + quadratic*sum((x^2)*(y^2))
  class(k) <- "kernel"

center_kernel <- function(kernel)
m <- length(kernel[,1])
ones <- matrix(rep(1,m),m)
(diag(m) - ones %*% t(ones) / m)*kernel*(diag(m) - ones %*% t(ones) / m)
f_product<- function(x,y)
f_norm<- function(x)
kernel_alignment <- function(x,y)
x_kernel1 <- kernelMatrix(kfunction(1,0),x)
y_kernel   <- y %*% t(y)
x_kernel2 <- kernelMatrix(kfunction(0,1),x)

x_kernel1_c <- center_kernel(x_kernel1)
x_kernel2_c <- center_kernel(x_kernel2)

y_kernel_c <- center_kernel(y_kernel)

alignment1 <- kernel_alignment(x_kernel1_c,y_kernel_c)
alignment2 <- kernel_alignment(x_kernel2_c,y_kernel_c)

x_kernel3 <- kernelMatrix(kfunction(alignment1, alignment2),x)
x_kernel3_c <- center_kernel(x_kernel3)
alignment3 <- kernel_alignment(x_kernel3_c,y_kernel_c)

svp <- ksvm(x,y,type="C-svc",C = 100, kernel=kfunction(alignment1,alignment2), scaled=c())
plot(c(min(x[]), max(x[])), c(min(x[]), max(x[])),type='n',xlab='x1',ylab='x2')
title(main='Parabolic Featureset (Mixed Kernel SVM)')
ymat <- ymatrix(svp)
points(x[-SVindex(svp),1], x[-SVindex(svp),2], pch = ifelse(ymat[-SVindex(svp)] < 0, 2, 1))
points(x[SVindex(svp),1], x[SVindex(svp),2], pch = ifelse(ymat[SVindex(svp)] < 0, 17, 16))
# Extract w and b from the model   
w <- colSums(coef(svp)[[1]] * (alignment1*x[SVindex(svp),]))
v <- colSums(coef(svp)[[1]] * (alignment2*x[SVindex(svp),]^2))
b <- b(svp)

x1 = seq(min(x[]),max(x[]),0.01)
x2 = seq(min(x[]),max(x[]),0.01)

points( sqrt( (b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( -sqrt( (b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( x1, sqrt( (b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )
points( x1, -sqrt( (b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )

points( sqrt( (1+ b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( -sqrt( (1 + b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( x1, sqrt( (1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )
points( x1, -sqrt( (1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )

points( sqrt( (-1+ b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( -sqrt( (-1 + b-w[2]*x2-v[2]*x2^2)/v[1] + (w[1]/(2*v[1]))^2 ) - w[1]/(2*v[1]) , x2, pch = 16, cex = .1 )
points( x1, sqrt( (-1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )
points( x1, -sqrt( (-1+b-w[1]*x1-v[1]*x1^2)/v[2] + (w[2]/(2*v[2]))^2 ) - w[2]/(2*v[2]) , pch = 16, cex = .1 )

title(main='Not Linear Separable in Quadratic Kernel Coordinates')

In conclusion this post attempts to illuminate the importance role of kernel selection in machine learning and demonstrates the use of kernel mixture techniques. Kernel weights were chosen somewhat ad-hoc by alignments to the target classifiers, thus preceding the actual SVM learning phase. More sophisticated algorithms and techniques that automate and combine these steps are a topic of current ongoing research.

Posted in Uncategorized | Tagged , , , , , , | 3 Comments