Intelligent Trading

Kaggle Winton Stock Market Challenge - Post-Mortem

2016-01-28T01:22:00.000-08:00

Recently, I participated in a Kaggle contest sponsored by Winton Capital. The contest provided various market related data and asked participants to predict intraday and next two day return forecasts over unseen future data. Prizes included $50,000 in monetary rewards and a chance to join Winton Capital's team of research scientists. I think these kinds of contests are a great way for modern data companies to scour hidden talent and in return, they offer great real world problems for both aspiring, as well as seasoned scientists to explore and validate ideas and methodologies with like minded individuals. I decided to write a post on my approach, so that users can repeat the methods and have some ideas behind the thought process.

Fig 1. Sample observations of integrated training 1 min. time series with black iis intraday and blue oos intraday, D1 and D2 are red and green, respectively.

The bad news is that I didn't fare very well in the contest (edit* looks like I made top 25%), however, there were a lot of interesting issues that surfaced as the contest progressed. I joined the contest fairly late and read some posts about how the data was altered in midstream. To understand this better, it might help to describe the data in more detail. The original dataset was comprised of two subsets, a train dataset with 40000 obs.and 211 features, and a test dataset with 40000 obs. and 147 features. The features could be broken into an ID column, a mix of 25 unlabeled continuous and discrete features, and 183 ordered time series returns. The time series returns were further broken down into -D1,-D2,1minD ,+D1,+D2 : the 1 min data represented a range of 179 intraday 1 min returns. The D1,D2 returns were prior overnight, and post daily returns, respectively. The goal was to predict the intraday and post two day returns beyond the original time series data points. So we were given roughly 2/3 of the prior returns and needed to predict 1/3 of the out of sample returns. Contestants were asked to submit the predicted test data results and received feedback on a reserved portion of the private dataset that were displayed as public leaderboard results. Typically, contestants can use the public leaderboard as a gauge of how well their models are performing on hidden, out of sample data. The overall final performance on the private leaderboard is withheld until the contest ends.

Many posters quickly figured out that the task was rather daunting and suspected that a zero baseline (just predict all zeros) or median of time series observations were pretty difficult to beat. However, what had changed in the contest, was that there was a possibility that some posters could have gamed the data and reverse engineered it to obtain very good scores on the public leaderboard. In response, the contest administrators decided to add additional test data to the original private test dataset, it was upped from 40000 unseen observations to 120000 observations. This was done to try to deter any gaming of the system. It also made gauging cross validation much more difficult as we were only given leaderboard feedback on 20000 of the observations. So we had to try to figure out the most robust model, given only 25% of training data, and make it robust to 75% of new hidden data! This with performance feedback on only 16.7% of the hidden data results being displayed on the public leaderboard. Additionally, some posters suspected that the data structures of the new dataset had been distorted from its original characteristics in order to avoid any gaming. I'm hoping to get more feedback from the contest and providers, in order to understand the data better.

My own approach was to use a gradient boosted model from the R gbm package. GBMs have historically performed well on kaggle competitions, have been shown to be one of the best performing models on a variety of datasets (see Kuhn, Applied Predictive Modelling), and are known to be a good choice for a robust out of the box model. Some of the benefits of GBMs, are ability to deal with NAs and handling of unscaled data, so we don't have to worry much about pre-procesisng data. As I didn't have much time to explore other options, I (perhaps wrongly) spent most of my time on improving the gbm cross-validated performance. Like others, I found that my public leaderboard score was consistently at the high range of my cv performance results. Since there was so much discussion about data manipulation, I (again likely wrongly) stubbornly believed in the cv results, and assumed that they might have biased the leaderboard set towards the worse performing data. In retrospect, I should have maybe spent more time on extremely simple models (like means and medians) as opposed to fitting a relatively small set of training data with a complex model. I also found many odd data characteristics, like intraday data columns that were perfectly trending up in the training set and down in the test set, some data had very little noise, others had lots of noise. My intuition was that not all of the time series data were from true financial time series, and that the distributions were not stable from training to test sets. This made the problem more difficult, IMO, then dealing with real data and having knowledge of the underlying data. I attempted to filter for different training and test samples, by using t-tests for comparisons, and setting a threshold for removing unusually different time series data columns.

Fig. 2. First 25 unlabled features highlighting unstable trn/tst distributions in Feauture_7.

Some posters observed that Feature_7 was unstable between train and test samples. Unfortunately, that was also the same feature that was highlighted as the most important feature in both D1, and D2 predictions from my GBM model. I left it in as I didn't have enough time to change strategies at that point and figured that I still had reasonable CV estimates that would beat the zero and median estimates on average. I ended up just using median estimates of training features for the out of sample 1 min predictions, and used a GBM to predict the next two out of sample days. We were also given a vector of loss weights for the training data and were instructed to use weighted mean absolute error as our loss function. R's GBM models allow us to set 'Laplace' as an absolute loss function. I also wrote out the loss funciton to manually record it in the cross validated loops and get a better feel for the sensitvity of the validation folds. A loop was created to record performance on 5 validation folds across several tree lengths. The optimal value was then used to rebuild final training models on all the training data, that could then be used to predict values for the test fold.

Fig. 3. below shows cross validated results from a simple GBM model using 5 fold validation, with shrinkage = 0.1, n.minobsinnode=50, interaction.depth =1. The red line shows the median value for the zero baseline as a comparison to the WMAE.val folds across different tree iterations. Notice that selecting, say n.trees = 900, yields a range of results that were all superior to the baseline median. I woud expect somewhere around 1765 for this model; however, as explained earlier, the private test set was difficult to match performance as distribution of features were different. I used the run below to illustrate, but the actual model I built had a lower shrinkage of 0.01 and much better performance, but took a long time to run -- as I had to use about 8000 trees.

Fig 3. 5 fold validation results with simple GBM model, shrinkage=0.1 n.minobsinnode=50 interaction.depth = 1

In conclusion, I was able to use R to build a GBM model with good cross validated performance against the baseline zero metric, but the model did not perform very well on the hidden private data. I believe two big factors were differences in train and test datasets, as well as a much larger hold out dataset that was different enough that the true results were much worse than 5 fold CV would predict. It's equally possible that the model was too complex for the very noisy dataset.

Some posters were wondering about why WMAE was used as opposed to RMSE. In financial time series, MAE is often used as a loss function as it is more robust to outliers. You can imagine why large outliers will not only skew the data and fit, but their contributions will effectively be squared using RMSE.

Lastly, although it was a great excercise in machine learning and real world modeling. In my experience, targeting continuous variables and using weighted mean absolute value loss functions may not be the best fitness to target. The great thing that it illustrates is the practical difficulty in predicting financial time series. Thanks to Winton and Kaggle for a great experience!

#####################################################
Code is below. Note* I used Pretty R to embed the code but there are color clashes with the background. You can copy and paste into a local browser to view more clearly. If ayone has suggestions on how to change the PrettyR font colors, please let me know.

library(gbm)
library(doParallel)
library(reshape2)
 
#trn <- read.csv("E:/Kaggle/Winton/train.csv") # 40k X 211
#tst <- read.csv("E:/Kaggle/Winton/test_2.csv") #120k X 147
 
trn <- readRDS("E:/Kaggle/Winton/trn.rds"); tst <- readRDS("E:/Kaggle/Winton/tst.rds")
 
interp <- function(x){
x <- as.numeric(x)
if(is.na(x[1])) x[1] <- x[!is.na(x)][1]
#if(is.na(x[length(x)])) x[length(x)] <- x[length(x)-1]
xsub <- x[2:length(x)]
#gap.na <- which(is.na(xsub))
 
for(k in 2:length(xsub)){
if(is.na(xsub[k])) xsub[k] <- xsub[k-1] 
}
x <- c(x[1],xsub)
return(x)
 
}
disc.impute <- function(x){
 
xfactor <- levels(factor(x))
x[is.na(x)] <- sample(xfactor,length(which(is.na(x))),replace=TRUE)
 
return(as.numeric(x))
}
cont.impute <- function(x){
 
x[is.na(x)] <-
runif(length(which(is.na(x))),
min(range(na.omit(x))),max(range(na.omit(x))))
 
return(x)
}
generate.folds <- function(idx, k){
 
fold.idx <- holdout <- kfold <- list()
fold.len <- floor(length(idx)/k) # if idx does not divide evenly ignore last few observations
 
for(i in 1:(k-1)){
holdout[[i]] <- sample(idx[!(idx%in%unlist(holdout))],fold.len)
kfold[[i]] <- idx[!(idx%in%holdout[[i]])]
}
holdout[[k]] <- idx[!(idx%in%unlist(holdout))]
kfold[[k]] <- idx[!(idx%in%holdout[[k]])]
 
folds <- list(kfold,holdout)
names(folds) <- c('trn','val')
 
return(folds)
}
qlimiter <- function(x,qlim) {
ifelse(x < quantile(x,(1-qlim),na.rm=TRUE),quantile(x,(1-qlim),na.rm=TRUE),ifelse(
x > quantile(x,qlim,na.rm=TRUE),quantile(x,qlim,na.rm=TRUE),x))
}
feat.dist.sel <- function(mx,my,pth){
out <- 0
for(i in 1:dim(mx)[2]){
out[i] <- t.test(mx[,i],my[i])$p.value
}
sel <- which(out > pth)
return(sel)
}
interp.med <- function(x) {
x[is.na(x)] <- median(na.omit(x)); x
}
 
 
# column breakpoints
f1 <- 2; f25 <- 26; tsiis1 <- 27; tsiis2 <- 147; tsoos1 <- 148; tsoos2 <- 207;
tsD1 <- 208; tsD2 <- 209;
wtint <- 210; wtdly <- 211; trn.obs <- dim(trn)[1]; tst.obs <- dim(tst)[1];
clip.th <- 0.9999
discrete.features <- c(1,5,8,9,10,13,16,20)
continuous.features <- c(2,3,4,6,7,11,12,14,15,17,18,19,21,22,23,24,25)
 
# preprocess and clean data 
all.iis <- rbind(trn[,1:tsiis2], tst[,1:tsiis2])
 
all.trn.iis.ts <- trn[,tsiis1:tsiis2]
all.trn.oos.ts <- trn[,tsoos1:tsoos2]
all.trn.oos.ts.cln <- apply(all.trn.oos.ts,2,interp.med)
 
all.trn.oos.D1 <- trn[,tsD1]
all.trn.oos.D2 <- trn[,tsD2]
all.trn.oos.D1.clip <- qlimiter(all.trn.oos.D1,qlim=clip.th)
all.trn.oos.D2.clip <- qlimiter(all.trn.oos.D2,qlim=clip.th)
 
all.iis.ts <- all.iis[,tsiis1:tsiis2]
all.tst.iis.ts <- all.iis.ts[(trn.obs+1):(trn.obs+tst.obs),]
 
ts.sel <- feat.dist.sel(all.trn.iis.ts,all.tst.iis.ts,pth=.05)
all.iis.ts.sel <- all.iis.ts[,ts.sel]
 
all.iis.ts.cln <- apply(all.iis.ts.sel,2,interp.med)
all.trn.iis.ts.cln <- all.iis.ts.cln[1:trn.obs,]
all.tst.iis.ts.cln <- all.iis.ts.cln[(trn.obs+1):(trn.obs+tst.obs),]
 
 
 
# K FOLD CV
folds <- list(); n.fold <- 5
folds <- generate.folds(seq(dim(trn)[1]), n.fold)
 
#gbmGrid <- expand.grid(interaction.depth = c(5,9,15), n.trees = seq(400,1000,by=200),
#shrinkage = 0.1, n.minobsinnode = 50)
out <- res  <- list()
tune.par <- seq(10,200,10)
 
 
for(z in 1:length(tune.par)){
cl <- makeCluster(n.fold)
registerDoParallel(cl)
 
res <- list()
rmcol = 0
 
n.trees <- tune.par[z]
interaction.depth <- 1
n.minobsinnode  <- 50
shrinkage <- 0.1
 
res <- foreach(k=1:n.fold, .combine=rbind) %dopar% 
{
 
library(gbm)
 
trn.fld <- folds$trn[[k]]
val.fld <- folds$val[[k]]
 
# Naive estimates use zero baseline as predictions
 
trn.iis <- all.trn.iis.ts.cln[trn.fld,] ; val.iis <- all.trn.iis.ts.cln[val.fld,] 
trn.int.oos <- all.trn.oos.ts.cln[trn.fld,] ; val.int.oos <- all.trn.oos.ts.cln[val.fld,] 
trn.day.D1 <- all.trn.oos.D1.clip[trn.fld] ; val.day.D1 <- all.trn.oos.D1.clip[val.fld]
trn.day.D2 <- all.trn.oos.D2.clip[trn.fld] ; val.day.D2 <- all.trn.oos.D2.clip[val.fld]
 
wt.int.trn <- trn[trn.fld,wtint]; wt.int.val <- trn[val.fld,wtint]
wt.day.trn <- trn[trn.fld,wtdly]; wt.day.val <- trn[val.fld,wtdly]
 
trn.zro <- rep(0,dim(trn.iis)[1]); #zero pred for intraday
val.zro <- rep(0,dim(val.iis)[1]); #zero pred for intraday
trn.err.int <- wt.int.trn*abs(trn.int.oos-trn.zro) 
val.err.int <- wt.int.val*abs(val.int.oos-val.zro)
trn.err.D1 <- wt.day.trn*abs(trn.day.D1-trn.zro)
val.err.D1 <- wt.day.val*abs(val.day.D1-val.zro)
trn.err.D2 <- wt.day.trn*abs(trn.day.D2-trn.zro)
val.err.D2 <- wt.day.val*abs(val.day.D2-val.zro)
 
WMAE.trn <- mean(c(as.matrix(trn.err.D1),as.matrix(trn.err.D2),as.matrix(trn.err.int))) #oos trn wmae
WMAE.val <- mean(c(as.matrix(val.err.D1),as.matrix(val.err.D2),as.matrix(val.err.int))) #oos tst wmae
WMAE.zro.all <- mean(c(WMAE.trn,WMAE.val))
 
# feature selection and cleaning outliers
all.feat  <- all.iis[,f1:f25]
all.feat[,3] <- qlimiter(all.feat[,3],.9999)
all.feat[,11] <- qlimiter(all.feat[,11],.9999)
trn.feat <- all.feat[trn.fld,]
val.feat <- all.feat[val.fld,]
 
 
trn.day.1 <- data.frame(trn.feat,all.trn.iis.ts.cln[trn.fld,],all.trn.oos.D1.clip[trn.fld])
names(trn.day.1)[dim(trn.day.1)[2]]<-'D1'
trn.day.2 <- data.frame(trn.feat,all.trn.iis.ts.cln[trn.fld,],all.trn.oos.D2.clip[trn.fld])
names(trn.day.2)[dim(trn.day.2)[2]]<-'D2'
 
val.day.1 <- data.frame(val.feat,all.trn.iis.ts.cln[val.fld,],all.trn.oos.D1.clip[val.fld])
names(val.day.1)[dim(val.day.1)[2]]<-'D1'
val.day.2 <- data.frame(val.feat,all.trn.iis.ts.cln[val.fld,],all.trn.oos.D2.clip[val.fld])
names(val.day.2)[dim(val.day.2)[2]]<-'D2'
 
# Build gbm models, here: use same ntrees for both D1,D2
gbm.trn.mod.D1 <- gbm(D1 ~ .,data=trn.day.1,distribution="laplace",  weights = wt.day.trn, 
n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode)
 
gbm.trn.mod.D2 <- gbm(D2 ~ .,data=trn.day.2,distribution="laplace", weights = wt.day.trn, 
n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode)
 
 
# TRN fold predictions
 
gbm.trn.pred.1 <- predict(gbm.trn.mod.D1, newdata=trn.day.1[,-dim(trn.day.1)[2]], 
n.trees=n.trees)
gbm.trn.pred.2 <- predict(gbm.trn.mod.D2, newdata=trn.day.2[,-dim(trn.day.2)[2]], 
n.trees=n.trees)
gbm.trn.act.1 <- all.trn.oos.D1[trn.fld] 
gbm.trn.act.2 <- all.trn.oos.D2[trn.fld] 
#par(mfrow=c(2,1))
#plot(gbm.trn.pred.1~gbm.trn.act.1)
#plot(gbm.trn.pred.2~gbm.trn.act.2)
 
 
trn.err.day.1 <- wt.day.trn*abs(gbm.trn.act.1 -gbm.trn.pred.1)
trn.err.day.2 <- wt.day.trn*abs(gbm.trn.act.2 -gbm.trn.pred.2)
trn.err.dly <- data.frame(trn.err.day.1,trn.err.day.2)
trn.err.int <- trn.err.int # from previous median estimates, wt.int.trn*abs(trn.int.oos-trn.med) 
WMAE.trn <- mean(c(as.matrix(trn.err.dly),as.matrix(trn.err.int)))
 
# VAL fold predictions
 
gbm.val.pred.1 <- predict(gbm.trn.mod.D1, newdata=val.day.1[,-dim(val.day.1)[2]], 
n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode)
gbm.val.pred.2 <- predict(gbm.trn.mod.D2, newdata=val.day.2[,-dim(val.day.2)[2]], 
n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode)
gbm.val.act.1 <- all.trn.oos.D1[val.fld] 
gbm.val.act.2 <- all.trn.oos.D2[val.fld] 
 
#par(mfrow=c(2,1))
#plot(gbm.val.pred.1~gbm.val.act.1)
#plot(gbm.val.pred.2~gbm.val.act.2)
 
val.err.day.1 <- wt.day.val*abs(gbm.val.act.1-gbm.val.pred.1)
val.err.day.2 <- wt.day.val*abs(gbm.val.act.2-gbm.val.pred.2)
val.err.dly <- data.frame(val.err.day.1,val.err.day.2)
val.err.int <- val.err.int # from previous median estimates, wt.int.trn*abs(trn.int.oos-trn.med) 
WMAE.val <- mean(c(as.matrix(val.err.dly),as.matrix(val.err.int)))
 
res[[k]] <- data.frame(k,interaction.depth,n.trees,shrinkage,n.minobsinnode,WMAE.zro.all,WMAE.trn,WMAE.val)
 
}
 
out[[z]] <- t(do.call(rbind,res))
stopCluster(cl)
gc()
}
 
# aggregate and plot data
{
out.folds <- do.call(rbind,out)
mean.res <- aggregate(cbind(WMAE.zro.all,WMAE.trn,WMAE.val)~n.trees,data=out.folds,mean)
 
plot(mean.res$WMAE.trn~mean.res$n.trees,type='o',pch=19,col='red')
lines(mean.res$WMAE.val~mean.res$n.trees,type='o',pch=19,col='blue')
 
 
yr <- range(mean.res$WMAE.zro.all,mean.res$WMAE.trn,mean.res$WMAE.val)
plot(mean.res$WMAE.zro.all~mean.res$n.trees,type='o',col='blue',pch=19,ylim=yr,main=
'8 flds sweep',ylab='8 FOLD CV WMAE')
text(mean.res$WMAE.zro.all~mean.res$n.trees,cex=.8,col='black',pos=3,label = round(mean.res$WMAE.zro.all,2))
lines(mean.res$WMAE.trn~mean.res$n.trees,type='o',pch=19,col='green',ylim=yr)
text(mean.res$WMAE.trn~mean.res$n.trees,cex=.8,col='black',pos=3,label = round(mean.res$WMAE.trn,2))
lines(mean.res$WMAE.val~mean.res$n.trees,type='o',pch=19,col='red',ylim=yr)
text(mean.res$WMAE.val~mean.res$n.trees,cex=.8,col='black',pos=3,label = round(mean.res$WMAE.val,2))
legend(800,1*yr[2],c("mean.res$WMAE.zro","mean.res$WMAE.trn","mean.res$WMAE.val"),
col=c('blue','green','red'),pch=19)
 
dev.new()
 
CV.df <- out.folds[,c(3,c(6:8))]
CV.df <- transform(CV.df,ID=seq(dim(CV.df)[1]))
 
df.m <- melt(CV.df, id.var=c('ID','n.trees'))
boxplot(value~variable+n.trees,df.m)
}
 
############################################################
##########    FINAL all TRN model generate/WMAE
############################################################
 
n.trees <- 900
interaction.depth <- 1
n.minobsinnode  <- 50
shrinkage <- 0.1
 
all.feat  <- all.iis[,f1:f25] #limit useless or sparse feature outliers
 
all.feat[,3] <- qlimiter(all.feat[,3],.9999)
all.feat[,11] <- qlimiter(all.feat[,11],.9999)
trn.feat <- all.feat[1:trn.obs,]
 
trn.day.1 <- data.frame(trn.feat,all.trn.iis.ts.cln,all.trn.oos.D1.clip)
names(trn.day.1)[dim(trn.day.1)[2]]<-'D1'
trn.day.2 <- data.frame(trn.feat,all.trn.iis.ts.cln,all.trn.oos.D2.clip)
names(trn.day.2)[dim(trn.day.2)[2]]<-'D2'
 
trn.day.1 <- data.frame(trn.feat,all.trn.oos.D1.clip)
names(trn.day.1)[dim(trn.day.1)[2]]<-'D1'
trn.day.2 <- data.frame(trn.feat,all.trn.oos.D2.clip)
names(trn.day.2)[dim(trn.day.2)[2]]<-'D2'
 
wt.day.trn <- trn[,wtdly]
 
 
gbm.trn.mod.D1 = gbm(D1 ~ .,data=trn.day.1, distribution="laplace",weights = wt.day.trn,n.trees=n.trees,shrinkage=0.1 ,cv.folds=5)
gbm.trn.mod.D2 = gbm(D2 ~ .,data=trn.day.2, distribution="laplace",weights = wt.day.trn,n.trees=n.trees,shrinkage=0.1 ,cv.folds=5)
 
############################################################
##########   OOS TST GENERATE
############################################################
 
tst.feat <- all.feat[(1+trn.obs):dim(all.feat)[1],]
 
tst.day.1 <- data.frame(tst.feat,all.tst.iis.ts.cln)
tst.day.2 <- data.frame(tst.feat,all.tst.iis.ts.cln)
 
all.oos.int.pred <- sapply(all.trn.oos.ts,median,na.rm=TRUE)
tst.pred.mtx <- t(matrix(rep(all.oos.int.pred,dim(tst)[1]),
length(all.oos.int.pred),dim(tst)[1]))
 
 
gbm.tst.pred.1 <- predict(gbm.trn.mod.D1, newdata=tst.day.1, 
n.trees = n.trees)
gbm.tst.pred.2 <- predict(gbm.trn.mod.D2, newdata=tst.day.2, 
n.trees = n.trees)
 
tst.submission.mtx <- cbind(tst.pred.mtx,gbm.tst.pred.1,gbm.tst.pred.2)
tstpred <- as.vector(t(tst.submission.mtx))
 
submission <- read.csv("E:/Kaggle/Winton/sample_submission_2.csv")
submission$Predicted <- tstpred
write.csv(submission,"E:/Kaggle/Winton/submission1.csv",row.names=FALSE)
subtst <- read.csv("E:/Kaggle/Winton/submission1.csv")

Created by Pretty R at inside-R.org

Review: Machine Learning An Algorithmic Perspective 2nd Edition

2015-04-03T00:24:00.000-07:00

Fig 1. Machine Learning: An Algorithmic Perspective. 2nd Edition. Stephen Marsland

I just wanted to briefly share some initial impressions of the 2nd edition of Stephen Marsland's very hands on text, "Machine Learning, An Algorithmic Perspective." Having been a big fan of the first, I requested a review copy from Dr. Marsland, and with his help, the publishers were kind enough to send me a review copy. I spent the the last few months going over most of the newer topics and testing many of the newer scripts in Python. With that, I'll dive into some of my impressions of the text.

I've stated before, that I thought the 1st edition was hands down, one of the best texts covering applied Machine Learning from a Python perspective. I still consider this to be the case. The text, already extremely broad in scope, has been expanded to cover some very relevant modern topics, including:

Particle Filtering (expanded coverage with working implementation in Python).
Deep Belief Networks
Gaussian Processes
Support Vector Machines. Now includes working implementation with cvxopt optimization wrapper.

     Those topics alone should generate a significant amount of interest from readers. There are several things that separate this text's approach from many of the other texts covering Machine Learning. One, is that the text covers a very wide range of useful topics and algorithms. You rarely find a Machine Learning text with coverage in areas like evolutionary learning (genetic programming) or sampling methods (SIR, Metropolis-Hasting, etc). This is one reason I recommend the text highly to students of MOOC courses like Andrew Ng's excellent 'Machine Learning', or Hastie and Tibshirani's, 'An Introduction to Statistical learning'. Many of these students are looking to expand their set of skills in Machine Learning, with a desire to access working concrete code that they can build and run.

     While the book does not overly focus on mathematical proofs and derivations, there is sufficient mathematical coverage that enables the student to follow along and understand the topics. Some knowledge of Linear Algebra and notation is always useful in any Machine Learning course. Also, the text is written in such a way, that if you simply want to cover a topic, such as particle filtering, you don't necessarily need to read all of the prior chapters to follow. This is useful for those readers looking to refresh their knowledge of more modern topics.

     I did, occasionally, have to translate some of the Python code to work with Python 3.4. However, the editing was very minimal. For example, print statements in earlier versions did not require parentheses around the print arguments. So you can just change print 'this' to print('this'), for example.

     I found the coverage of particle filters and sampling, highly relevant to financial time series-- as we have seen, such distributions often require models that depart from normality assumptions. I might possibly add a tutorial on this, sometime in the future.

   In summary, I highly recommend this text to anyone that wants to learn Machine Learning, and finds the best way to augment learning is having access to working, concrete, code examples. In addition, I particularly recommend it to those students that have followed along from more of a Statistical Learning perspective (Ng, Hastie, Tibshirani) and are looking to broaden their knowledge of applications. The updated text is very timely, covering topics that are very popular right now and have little coverage in existing texts in this area.

*Anyone wishing to have a deeper look into topics covered and code, can find additional information and code on the author's website.

IBS Reversion (Stingray Plot) Intuition using R vioplot Package

2013-04-29T22:24:00.000-07:00

Fig 1. Vioplot of IBS filtered next day returns on SPY.

Fig 2. Table of Summary of Results for rtn.tom vs. IBS.class (H,L,M)

A colleague and I were recently discussing ways to get intuition about the IBS classification method for reversion systems. I thought I'd share a violin plot I generated that might help to get some visual intuition about it. We can download and process next day returns for an asset like SPY and group classes into LOW (IBS < 0.2), HIGH(IBS > 0.8), and MID(all others). One thing you can see in the plots is the pronounced right skew in the LOW class, and left skew in the HIGH class (they sort of resemble opposing stingrays -- stingray plots might be a more apt term for the reversion phenomena); while the MID class tends to be more symmetrical. The nice thing about the vioplot visualization is that it includes the density shape of the return distribution, which adds intuition over more common box and whisker plots.

Is CTA trend following Dead?

2013-03-10T17:31:00.000-07:00

This is just a very short comment related to discussions I've been having with a friend about trend following funds and a lot of the recent blogs and debates proclaiming the death of trend following.

Fig. 1 Barclay CTA Index

Using the Barclay CTA Index as a proxy, we can certainly see that there was a huge level shift of performance from around the 1990s and onwards, making the annual return appear to be on an almost exponential decay. However from about 1990 to present, the annual returns have have been steadily oscillating in a range band from around -1% to 13% (with some outliers), and although the return per decade has been dropping over the last few decades, there's not really enough data to make any strong judgements about it's demise. Just looking at the oscillatory behavior and considering the bearishness towards this class of funds-- It might just suggest a good reversion based long bet on trend-followers.

IBS reversion edge with QuantShare

2013-01-04T18:55:00.000-08:00

Happy New Years to readers; my resolution this year is to continue delivering thoughts and ideas to others in the hopes that we all might be able to benefit somewhat from sharing observations. I'll start by describing an edge using QuantShare as the back-testing engine.

Fig 1. Optimized (overfit) SPY IBS Long run Performance

Have you ever had an edge that worked fairly well and then suddenly found that it was almost simultaneously revealed across several sources? This seemed to be the case with an edge I had discovered some time back via data mining. I first noticed it was divulged in a recent copy of Active Trader, 'The low-close edge,' by Nat Stewart. Later, I found the concept had also spread out into the blogosphere. Two notable write-ups can be found here and here. The acronym IBS seems to be the buzzword floating around, so I'll continue to use it here. IBS stands for Internal Bar Strength (not to be confused with the IBS that many traders might have developed over the years). The strength indicator is described with an extremely simple equation:

$IBS = \dfrac{Close - Low}{High -Low}$

What it describes is the relative position of the close with respect to the low to high range of the period. When Jaffray Woodriff was interviewed in the latest Hedge Funds Wizards book, he described a very simple predictive indicator (based only upon transformations of the Open, High, Low, and Close of the data) that had proved remarkably stable over the years. It inspired some debate in statistical and machine learning circles, but nevertheless, sparks images of a holy grail. If there was ever a hypothesis model that came close to his description, I'd certainly consider this as a candidate for the reversion side. In addition to simplicity, one of the reasons it is so useful is that unlike many other approaches at feature transformation of raw financial series, it is scale invariant and does not require any further scaling to support non-stationary data. The results of the transformation will always be bound between 0 and 100%. So the transformed features will always be bound inside of a fixed and finite space regardless of the evolving data properties (a great property for machine learning). The algorithm runs very fast and does not require frequent readjusting of model parameters unlike many online or econometric based models.

The system simply buys at the close when the IBS indicator closes near the low end of the day and goes short when the indicator closes near the high of the day; exit is next day close. Much of the time the indicator is neutral or no trade, allowing a good net risk adjusted return with low exposure. The thresholds, while often mentioned as being set to the 0.25 and 0.75 quartiles of the range, can be adjusted or found manually in the optimization settings.

Fig 2. Performance fit In Sample Optimization (to 2000).

In order to avoid hindsight bias and over-fitting error (as in Fig 1.), I show an optimization using only in sample data for SPY (yahoo data) up to the year 2000 (rank sorted by CAGR and Sharpe). One interesting thing we notice is that the higher threshold is actually optimized to 1, meaning no reversion from the high/short side. This is consistent with what we would expect with a long bias/drift market. We always have to be careful about systematic shorting with a market that has long term positive drift.

   Fig 3. In Sample/ Out of Sample Performance
   with In Sample fitted parameters.

Fortunately, even without the short high reversion side, the system performed well for the rest of the out of sample data (Fig 3.).   A last comment is that looking at the over-fit data should gives us some insight about reversion systems and high volatility sell off regimes.

I've attached code to allow readers to repeat results.

Simulated back-test results long only. $10,000 Principle. No Slippage/Comm. incl.

Book Review: R for Business Analytics, A Ohri

2012-10-26T19:05:00.000-07:00

I've added a recently released book to my list of recommendations (at the amazon carousel to the right), as I've reviewed a copy provided to me via Springer Publishers. The book is R for Business Analytics, authored by A Ohri. Mr. Ohri provides us with a brief background of his own journey as a business analytics consultant, and shares how R helped complement his work with a very low cost (time to learn the software) and very large benefits. At the outset, he emphasizes that the book is not geared towards statisticians, but more towards practicing business analytics professionals, MBA students, and pragmatically oriented R neophytes and professionals alike. In addition, there is a focus on using GUI oriented tools towards assisting users in quickly getting up to speed and applying business analysis tools (Rattle, for example, is covered as an alternative to Weka, which has been covered here previously). In addition, he provides numerous interviews with well known company representatives who have either successfully integrated R into their own development flow (including JMP/SAS, Google, and Oracle ), or found that large groups of customers have utilized R to augment their existing suite of tools. The good news is that many of the large companies do not view R as a threat, but as a beneficial tool to assist their own software capabilities.

After assisting and helping R users navigate through the dense forest of various GUI interface choices (in order to get R up and running), Mr. Ohri continues to handhold users through step by step approaches (with detailed screen captures) to run R from various simple to more advanced platforms (e.g. CLOUD, EC2) in order to gather, explore, and process data, with detailed illustrations on how to use R's powerful graphing capabilities on the back-end.

The book has something for both beginning R users (who may be experienced in data science, but want to start learning how to apply R towards their field), and experienced R users alike (many, like myself, may find it useful to have a very broad coverage of the myriad number of packages and applications available, complemented by quickly accessible tutorial based illustrations). In summary, the book has an extremely broad coverage of R's many packages that can be used towards business data analysis, with a very hands on approach that can help many new users quickly come up to speed and running on utilizing R's powerful capabilities. The only potential down-side is that covering so many topics, comes at a cost of sacrificing some depth and mathematical rigor (leaving the door open for readers to pursue several more specialized R texts).

The Kaggle Bug

2012-08-22T14:52:00.000-07:00

If you have any interest in data mining and machine learning, you might have already caught the Kaggle bug.

I myself fairly recently got caught up in following the various contests and forums after reading a copy of "Practical Time Series Forecasting," -- 2nd edition, by
Galit Shmueli. What makes the contests great are that they allow any ambitious and creative data scientist or amateur enthusiast to participate in and learn a wealth of new knowledge and tricks from more experienced professionals in the field.

What should make it even more interesting to readers here is considering that many of the winners that participate in these high purse contests are often from the financial world. Take one of my personally inspirational traders, Jaffray Woodriff, hedge fund manager of well-known machine learning oriented hedge fund, Quantitative Investment Management (better known by its acronym - QIM). I had mentioned recently to a surprised friend, that Mr. Woodriff had also participated in the more well-known Netflix prediction contest (having been a member of the third-place team at one point).

In particular, the most recent contest that has many eager followers watching is the $3,000,000 Heritage Provider --Heritage Health Prize Competition, which is an open contest to predict likelihood of patient hospital admission. What particularly inspired this blog post is a very useful blog from one of the leading contestants, Phil Brierley a.k.a. handle, Sali Mali, who has interestingly joined with the marketmaker team, also affiliated with a prediction related fund. Mr. Brierley has shared tremendously useful insights about practical methods of attacking the problem-- all the way from SQL preprocessing and cleaning to intuitive visualization methodologies. I applaud him for his generous sharing of insights to the rest of the predictive analytics community. Although he hasn't posted in a while, his journal of thoughts are still highly useful.

Anyone looking for grubstake could certainly use three million to get started=)

Below are the specific links mentioned…

http://anotherdataminingblog.blogspot.com/
http://www.heritagehealthprize.com/c/hhp
http://www.kaggle.com/

...and one newer from stack exchange
http://blog.stackoverflow.com/2012/08/stack-exchange-machine-learning-contest/?cb=1

The Facebook Doomsday Watch

2012-05-30T23:02:00.001-07:00

I've been following the myriad circus of Facebook commentators and bystanders pointing to its horrific failed IPO launch and seemingly inevitable crash to zero. While my focus here isn't really so much on fundamentals or basic TA; I do want to comment on some subjective thoughts on the matter as well as illustrate one catchy graphic I put together.

Fig 1. FB IPO drawdown (with potential trajectory) vs. EBAY historical IPO opening (a.k.a the U-TURN pattern).

Having lived through and experienced the many ballyhooed IPO juggernauts of the past, I can't help but think back to how overvalued I 'felt' stocks like Ebay and Google felt to me at the outset. We all know that that we can't directly compare such small samples in any statistical manner with much conviction, but that qualitative sense in me feels that Facebook is one of those Wall Street darlings we rarely encounter and wish we could go back and buy at a discount. Sure there were the megaflops (Blackstone, The Globe, etc) that never revived quite back, but then again consider the Lynch method of buying (...are the masses using it?), the massive institutional support available, the number of shorts that are sure to pile on, and more importantly, the nagging fact that it is consistently one of the highest viewed websites of all (typically above or next to Google and Baidu -- don't believe me, check Alexa). Ok, but enough of the soapbox on those biased musings-- one quantitative comparison to consider in the chart above is how Ebay fared at the outset and was also lambasted as a failure throughout Internet chat-rooms and various media pundits. What I have graphed is a drawdown for both (with adjusted ebay quotes) relative to the 1st day open bid. The last points on Facebook are potential drawdown (relative to IPO opening price) trajectories at

28.84	-31.41% (yesterday's close)
26.5	-36.98%
24	-42.93%
23	-45.30%

So, I leave you with that as food for thought. I don't often discuss my thoughts about semi qualitative opportunities, but then again, we don't get these types of juggernaut long term opportunities all that often.* As always, please make your own informed decisions, and I'll try to get back on topic... one of these days.

* Two other counter points (amongst many excluded) that I'm sure some more astute observers will note.
1) Ebay IPO U-Turn occurred during the mega bull run dot com mania.
2) If the Greece (or insert any suitable catalyst here) fiasco escalates into a fat tail flight to safety avalanche (of which I pointed out have been exceedingly abundant of late); then keep in mind Facebook and any other equity leaders should be expected to plunge together; hence, the emphasis on LONG term portfolio component opportunity.

Expanding Visualization of published system edges (R)

2012-02-28T21:59:00.003-08:00

I happened to be looking over a revised text of a systems author I happen to follow. I will be a bit vague about specifics, as the system itself is based on well known ideas, but I'll leave the reader to research related systems. The basic message illustrated in this post, is that I often make an effort to look at different viewpoints of system related features that are not always explored in the texts.

Fig 1. BarGraph Illustration showing 0.48% average weekly gain given conditional system parameters, over arbitrary trades giving 0.2% average return per week over same 14 yr. period.

For example, the following system is based upon buying at pullbacks of a certain equity series and holding for a week. In the book, a bargraph is shown illustrating the useful edge of about 0.48%/trade vs. simply buying and holding for 0.2%/trade. Although, the edge is useful and demonstrated well in the bargraph illustration, it can be useful to look at the system performance from various different perspectives.

As an example, we might wonder how the system unfolded over time. In order to look at this, we can plot a time series representation of the system's equity curve (assuming 100% compounding, no slippage, and no fractional sizing). The curve is shown compared to a straw-broom plot of 100 monte carlo simulation paths of the true underlying data, comprised of randomly and uniformly selected data over the same period.

Figure 2. Plot of edge based equity path vs. simulated Monte Carlo Straw-Broom Plots of randomly selected series based upon true underlying data.

Looking at the 'unwinding' of the actual system's 14yr. time series path, we can make a few observations.

1) The edge, in terms of terminal wealth only, far outperforms several randomly simulated data paths built from the actual instrument.

2) Unfortunately, the edge also has very wild swings and variation (resulting in a very large drawdown).

Had we blindly selected the system itself based upon the bar graph alone, it's very possible, that we could have entered at the worst possible time (just prior to the drawdown). It also illustrates an issue I personally have with using simple monte carlo analysis (with IID assumptions) as a proxy to underlying data. Namely, that the auto-correlation properties have been filtered out, making the system based edge appear much better in comparison. I have spent a lot of time thinking about ways to deal with this issue; but it's a discussion for another day. But it's not necessarily a bad result at all. Rather it gives us some features (persistence and superior edge/trade) that we can use as a spring-board for further optimization; for example, we might think about adding a conditional filter to mitigate the large drawdown based upon underlying features that may have coincided during that period.

Data was plotted using R ggplot2. Although I think the plotting tool is excellent, I find that the processing time is a bit consuming.

MINE: Maximal Information-based NonParametric Exploration

2012-01-31T23:21:00.000-08:00

There was a lot of buzz in the blogosphere as well as the science community about a new family of algorithms that are able to find non-linear relationships over extremely large fields of data. What makes it particularly useful is that the measure(s) it uses are based upon mutual information rather than standard pearson's correlation type measures, which do not capture non-linear relationships well.

The (java based) software can be downloaded here: http://www.exploredata.net/ In addition, there is the capability to directly run the software from R.

Fig 1. Typical non-linear relationship exemplified by intermarket relationships.

The algorithm seems promising as it would allow us to possibly mine very large data sets (such as financial intermarket relationships) and find potentially meaningful non-linear relationships. If we were to use the typical pearson's correlation measures, such relationships would show very small R^2 values, and thus be discarded as non significant relationships.

I decided to take it for a spin on an example of a non-linear example, taken from M. Katsanos' book on intermarket trading strategies (p 25. fig 2.3). In figure 1, we can clearly see that the relationship between markets is non-linear, and thus the traditional linear fit returns a low R^2 value of .143 (red line), a loess fit is also shown in blue. After running the same data through MINE, the results returned in a .csv file, were...

MIC (strength) MIC-p^2 (nonlinearity)
0.16691002 0.62445 7.129283 -0.3777441

The MIC (Mutual Information Coefficient) of .167 was not much greater than theR^2 measure of .143 above. However, one of the mentions in the paper was that as the signal becomes more obscured by noise, the MIC will degrade comparably.

The next step would be too find some type of fit to minimize the noise component and make updated comparisons.

In order to show a better illustration of how useful it might be, I am attaching a screenshot of the reference material here.

Figure 2. Reproduced from Fig 6. 'www.sciencemag.org/cgi/content/full/334/6062/1518/DC1'

Notice the MIC Score measure outperforms other traditional methods on many non-linear structural relationships.

Here is the full R-Code to repeat the basic experiment.
###############################################
# MINE example from intelligenttradingtech.blogspot.com 1/31/2012

library(quantmod)
library(ggplot2)

getSymbols('^GSPC',src='yahoo',from='1992-01-07',to='2007-12-31')
getSymbols('^N225',src='yahoo',from='1992-01-07',to='2007-12-31')

sym_frame<-merge(GSPC[,6],N225[,6],all=FALSE)
names(sym_frame)<-c('GSPC','N225')

p<-qplot(N225, GSPC, data=data.frame(coredata(sym_frame)),
geom=c('point'), xlab='NIKKEI',ylab='S&P_500',main='S&P500 vs NIKKEI 1992-2007')

fit<-lm(GSPC~ N225, data=data.frame(coredata(sym_frame)))
summary(fit)
fitParam<-coef(fit)

p+geom_abline(intercept=fitParam[1], slope=fitParam[2],colour='red',size=2)+geom_smooth(method='loess',size=2,colour='blue')

### MINE results
library("rJava")
setwd('/home/self/Desktop/MINE/')

write.csv(data.frame(coredata(sym_frame)),file="GSPC_N225.csv",row.names=FALSE)
source("MINE.r")
MINE("GSPC_N225.csv","all.pairs")

##########################################################

The referenced paper is, "Detecting Novel Associations in Large Data Sets"
David N. Reshef, et al.
Science 334, 1518 (2011)

As an aside, I've been hooked on a sitcom series called, "Numb3rs," playing on Amazon Prime. It's about an FBI agent who gets assistance from his genius brother, a professor of Mathematics at a prestigious University. So far, they've discussed markov chains, bayesian statistics, data mining, econometrics, heat maps, and a host of other similar concepts applied to forensics.

Free Online Stanford Machine Learning Course: Andrew Ng. Post Mortem.

2012-01-01T01:33:00.000-08:00

Happy New Year to all the viewers of this blog and just a short reminder that the course will be available again this January.
http://www.ml-class.org/course/auth/welcome

Having audited the course, I would highly recommend it to anyone who is interested in a very hands on learning session covering many of the topics I've posted about (and many other areas, such as how to deal with over/under fitting). Kudos to Dr. Ng for a fantastic, engaging, and informative course.

As an added incentive, users will become familiarized with many vectorized approaches to programming (via Octave), which are very useful in languages such as Python and R.

Free auditing of Stanford AI and Machine Learning Courses w/Peter Norvig

2011-10-14T18:02:00.000-07:00

Just wanted to notify viewers of a few great courses that are being offered free for auditing and/or participation by well known industry experts, including co-author of the classic text on AI, 'Artificial Intelligence: A Modern Approach,' Peter Norvig and Prof. Andrew Ng.

http://www.ai-class.com/
see also,
http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/10/14/BUFR1LH9JR.DTL

The notice is a bit late, but they are still accepting registrations.

Spatio-Temporal Data Mining: 2

2011-10-06T12:12:00.000-07:00

There are many visual methods used to identify patterns in space and time. I've discussed some in prior threads and will show a few others briefly here. One of the most difficult questions I often hear from others regarding markov type approaches, is how to identify states to be processed.

It is a similar problem that one encounters using simple linear type factor analysis. Unfortunately, there is no simple answer; however, because data streams are becoming so vast it becomes almost impossible to enumerate over all possible state sets. Visual mining techniques can be incredibly helpful in narrowing down that space as well as feature reduction. I often use these types of visualizations back and forth with unsupervised classification type learners to converge on useful state identifications.

Fig 1. Spatio-Temporal State plot

Figure 1 gives an idea on visualizing states with respect to time. But having such knowledge in isolation doesn't give us much use. We are more interested in looking for Bayesian type relationships between states that give some probabilities between linked states in time.

Fig 2. Fluctuation Plot

Several visual methods exist to capture the relationships visually. One common plot used in language processing and information theory, is a fluctuation plot. The above plot was built using the same state data as the first graph. It is often used to determine conditional relationships between symbols such as alphabet tokens. The size of each box is directly proportional to the weight of the transition probabilities between row and column states in tabular data. An example would be to think of the letters yzy more commonly followed by g (as in syzygy) than any other state token; thus, one would expect to quickly spot a larger box across a row of states representing the 'yzy' row token n-gram and 'g' column token .

Both plots were produced in R. ggflucuation() is a plot command utilized from ggplot2. I am currently investigating how much easier and faster it might be to process such visualizations in tools like protovis and processing. I've been especially inspired by reading some of Nathan Yau's excellent visualization work in his book, 'Visualize This.' I included it in the link to the right for interested readers.

Arc Diagram and spatiotemporal data mining visualization

2011-09-23T17:49:00.000-07:00

I won't spend too much time discussing this fascinating topic other than to say it relates very much to prior discussions about pattern discovery via visual data mining (see lexical dispersion plots for example). I happened across an interesting visualization method called the Arc Diagram, developed by Martin Wattenberg. Working for data visualization groups at IBM and later Google, he developed some interesting visual representations of spatiotemporal data.

Fig 1. Arc Diagram and legend with example of discretized pattern archetype.

The resulting plot generates some fascinating temporal signatures, similar to what one might see in phase-space portraits from chaos. However, they have been frequently utilized to look for spatiotemporal signatures in music. One might discern a type of underlying order or visual signature of complexity as well as recurring patterns in sequential objects ranging from text based lyrical information to musical sheet notes.

Figure 1 shows an example of how one might utilize this tool towards temporal pattern discovery in time series. A weekly series from SPY has been discretized into alphabet tokens, based upon the bin ranges in the included legend. The small chart in the example would decode an archetypal pattern for the following sequence: ECDCECCD, into a time series representation of the 8 week data symbol. The following interactive java tool from another blogger, Neoformix, was then used to translate the data into an Arc Diagram. http://www.neoformix.com/Projects/DocumentArcDiagrams/index.html . Read from top to bottom, one can look at recurring and related patterns that are repeated over time; certain behavior might warrant further investigation.

You can copy the following data stream into the tool to toy around with the tool to get a feel for the possibilities of visual pattern discovery.* I won't go into too much more detail about utilizing it, other than to say it appears to be a very useful tool in temporal based pattern discovery.

Please see the following for more ideas on arc diagrams and musical signatures:
http://www.research.ibm.com/visual/papers/arc-diagrams.pdf

http://turbulence.org/Works/song/mono.html

Blog mentioned:
http://www.neoformix.com/

* Not sure how to attach .xls file here, but if anyone wants a copy of the .xls file, you can send me an email and I'll try to get it out to you. Otherwise, you can simply grab a song lyric off the web to play with the tool.

Aug 4, 2011 "plunge" headlines are in the air tonight

2011-08-04T15:44:00.000-07:00

Today's financial headlines are littered with the word 'plunge.' Considering today's (cl-cl) drop on the S&P500 was just about -5%, I don't know that I would exactly call that a plunge.

Fig 1. Historical ts plot of S&P500 returns <= -5%

The following R code produced a time series plot of historical occasions where this occurred.

###################################################

library(quantmod)

getSymbols("^GSPC",from="1950-01-01",to="2012-01-01")
adj<-GSPC$GSPC.Adjusted
rtn<-(adj/lag(adj,1)-1)[2:length(adj)]
r05<-rtn[rtn<= -.05]

plot(sort(r05),type='o',main='S&P500 1950-present returns <= -5%')

###################################################
Although the frequency of such occurrences is arguably rare, the 1987 drop is much more worthy of the 1 day label 'plunge.'

One other disturbing observation in the data, however, is the large temporal clustering of occurrences in the recent 2008 region. Now that's behavior to be concerned about (not to mention revised flash crash data pts.).

filtered 1 day cl-cl returns <=-5% sorted by date

Pattern Recognition: forward Boxplot Trajectories using R

2011-07-28T14:54:00.000-07:00

Although the following discussion can apply to the Quantitative Candlestick Pattern Recognition series, it is addressing the same issue as any basic conditional type system -- how and when to exit. The following is one way to visualize and think about it, and is by no means optimal.

Fig 1. Posterior Boxplot Trajectory

Often we attempt to find some set of prior input patterns that leads to profitable posterior outcomes. However, in most of the available examples, we are typically only given heuristics and rules of thumb on where to exit. This might make sense, since no one can accurately predict where to exit. However, with knowledge of past samples, we can have some idea of where a good target to exit might be, given the prior knowledge of forward trajectories. I dubbed the name 'boxplot trajectory', here, as I think it's a useful way to visualize a group of many possible outcome trajectories for further analysis.

In this example, a set of daily price based patterns was analyzed via a proprietary program I wrote in R, which resulted in an input pattern yielding a set of 52 samples that met my conditional criteria. Fig 1 illustrates a way to look at the trajectory outcomes based upon one of the profitable patterns in the conditional criteria. The bottom graph is simply the plot of median results of each data point in the trajectory. We often try to imagine the best way to exit without foreknowledge of the future (and somewhat less rule of thumb based criteria).

Fig 2. Trajectory tree.

One approach would be to use some type of exiting rule based upon the statistical median of each sequential point's range. Knowing that 1/2 of the vertices occur above and 1/2 below the median, we should expect to hit at least 1/2 of the targets at or above the median. Given that the 3rd point is the highest median, it makes sense to exit earlier than waiting for a greater gain further out (which has an even lower median). So if we take as a target, the median value of the 3rd pt. we achieve an average and fixed target of 1.59% on 27/52 of the total samples.

Of the remaining samples, we may now wish to exit on the 11th bar (or earlier if the same target is hit earlier) target of .556%, which is achieved on 13/52 of the remaining samples. This leaves only the last bar of which we simply use the average return as the weighted return value for that target, in this case -1.74% for the remaining samples : 12/52. Notice we will always have the worse contenders that were put off until the end.

The expectation yields E(rtn)=27/52*.0159+13/52*.0056-12/52*-.017 =.0057
eeking out a small average + gain of .57%. Compounded, this gives:
(1+.0159)^27*(1+.0056)^13*(1-.017)^12~ 34% rtn for 52 trades, each less than 3 days in length. Hit rate (as secondary observation) is 77% in this case.

The approach is particularly appealing for a high frequency strategy with very low commissions. Notice it's by no means comprehensive (and yes, I've only shown in sample here), but rather a novel way to think about exiting strategies.

Simulating Win/Loss streaks with R rle function

2011-05-17T15:34:00.000-07:00

The following script allows you to simulate sample runs of Win, Loss, Breakeven streaks based on a random distribution, using the run length encoding function, rle in R. Associated probabilities are entered as a vector argument in the sample function.

You can view the actual sequence of trials (and consequent streaks) by looking at the trades result. maxrun returns a vector of maximum number of Win, Loss, Breakeven streaks for each sample run. And lastly, the prop table gives a table of proportion of run transition pairs from losing streak of length n to streak of all alternate lengths.

Example output (max run length of losses was 8 here):

100*prop.table(tt)
    lt.2
lt.1      1      2      3      4      5      6      7      8
   1 41.758 14.298 5.334 1.662 0.875 0.131 0.000 0.044
   2 14.692 4.897 1.924 0.787 0.394 0.087 0.131 0.000
   3 4.985 2.405 0.525 0.350 0.000 0.000 0.044 0.000
   4 1.662 0.875 0.306 0.087 0.000 0.000 0.000 0.000
   5 0.831 0.219 0.175 0.000 0.000 0.044 0.000 0.000
   6 0.087 0.131 0.044 0.000 0.000 0.000 0.000 0.000
   7 0.087 0.087 0.000 0.000 0.000 0.000 0.000 0.000
   8 0.044 0.000 0.000 0.000 0.000 0.000 0.000 0.000

maxrun
B L W
3 8 17

-----------------------------------------------------------------------------------------
#generate simulations of win/loss streaks use rle function

trades<-sample(c("W","L","B"),10000,prob=c('.6','.35','.05'),replace=TRUE)
traderuns<-rle(trades)
tr.val<-traderuns$values
tr.len<-traderuns$lengths
maxrun<-tapply(tr.len,tr.val,max)

#streaks of losing trades
lt<-tr.len[which(tr.val=='L')]
lt.1<-lt[1:(length(lt)-1)]
lt.2<-lt[2:(length(lt))]

#simple table of losing trade run streak(n) frequencies
table(lt)

#generate joint ensemble table streak(n) vs streak(n+1)
tt<-table(lt.1,lt.2)
#convert to proportions
options(digits=2)
100*prop.table(tt)
maxrun

High Low Clustering on intraday high frequency sampled data

2011-05-10T19:42:00.000-07:00

Nothing unusually exciting on this post, but I happened to be engaged in some particle based methods recently and made some simple visual observations as I was setting up some of the sampling environment in R. I am also using Rkward and Ubuntu to generate, so I'm gathering everything from the current environment (including graphics).

Fig 1. Parallel plot of half hr sample of High and Low intraday data points vs time (Max is purple dots, Min are red). Fig 2. Cumulative count of high low events per interval (blue = total high and low).

The plot illustrates sampled intraday data at half hour increments.
The highs and lows of each sample interval are overlaid using purple to denote an intraday high and red to denote an intraday low.
Interesting points of observation are--

1) The high and low samples tend to be clustered at open, midday, and close.
2) High and low events do not appear to be uniformly and randomly distributed over time.
This kind of data processing is useful towards generating, exploring, and evaluating pattern based setups.

The study is by no means complete or conclusive, just stopping by to show more of the type of data processing and visual capabilities that R is capable of. If anyone has done any more conclusive studies I'd be interested to hear.

P.S. If anyone notices any odd changes, for some reason Google was having some issues the last few days, and it appears to have reverted to my original (not ready to launch) draft.

Can one beat a Random Walk-- IMPOSSIBLE (you say?)

2011-03-08T15:12:00.000-08:00

Firstly, apologies for the long absence as I've been busy with a few things. Secondly, apologies for the horrific use of caps in the title (for the grammar monitors). Certainly, you'll gain something useful from today's musing, as it's a pretty profound insight for most (was for me at the time). I've also considered carefully, whether or not to divulge this concept, but considering it's often overlooked and in the public literature (I'll even share a source), I decided to discuss it.

Fig 1. Random Walk and the 75% rule

I've seen the same debate launched over and over on various chat boards, which concerns the impossibility of theoretically beating a random walk. In this case, I am giving you the code to determine the answer yourself.
The requirements: 1) the generated data must be from an IID gaussian distribution 2) series must be coaxed to a stationary form.

The following script will generate a random series of data and follow the so called 75% rule which says,
Pr[Price>Price(n-1) & Price(n-1) < Price_median] Or [Price < Price(n-1) & Price(n-1) > Price_median] = 75%. This very insightful rule (which is explained both mathematically and in layman's terms in the book 'Statistical Arbitrage' linked on the amazon box to the right), shows that given some stationary, IID, random sequence that has an underlying Gaussian distribution, the above rule set can be shown to converge to a correct prediction rate of 75%!

Now, we all know that market data is not Gaussian (nor is it commision/slippage/friction free), and therein lies the rub. But hopefully, it gives you some food for thought as well as a bit of knowledge to retort, when you hear the debates about impossibilities of beating a random walk.

R Code is below.

##################################################
#gen rnd seq for 75% RULE

#generate stationary rw time series
rw<-rnorm(100)

m<-median(rw)
trade<-rep(0,length(rw))

for(i in 1:(length(rw)-1)){
if(rw[i] < m) trade[i]<- (rw[i+1]-rw[i])
if(rw[i] > m) trade[i]<- (rw[i]-rw[i+1])
if(rw[i] == m) trade[i]<- 0}

eq_curve<-cumsum(trade)

par(mfrow=c(2,1))
plot(rw,type='l',main='random walk')
plot(eq_curve,type='l',main='eq_curve')

Finally! A practical R book on Data Mining: "Data Mining With R, Learning with Case Studies," by Luis Torgo

2010-11-19T18:55:00.000-08:00

I've been a bit busy lately with a few big things, however, I wanted to stop by and mention a fantastic book for those who have been following along the R examples. Anyone who's followed my blog knows that I'm big on practical books with examples. There are also three main open source tools I've discussed with regards to prototyping trading systems: Weka, Python, and R. Of the three tools mentioned, I've been able to recommend Witten and Frank's book on Data Mining for Weka, and Stephen Marsland's book on Machine Learning as the Python bible for hands on Machine Learning. Well now, I can thankfully complete the trinity, with Luis Torgo's new book, 'Data Mining with R, Learning with Case Studies.'

Both R novices and experts will find this a great reference for Data Mining. The opening chapter has a useful intro to get you started on R (Factors, Vectors, and Data Frames, as well as other useful objects are covered with examples). Additional chapters cover both classification and regression type prediction schemes.

The most useful chapter to readers here, however, is the chapter on 'Predicting Stock Market Returns.' Many of the readers who have been looking for example scripts on some of the topics I've covered, will find them here. Not only is gathering and processing data (CSV, quantmod and yahoo finance, and MySQL) well covered, but various prediction and evaluation schemes (cross validation, sliding and growing windows, PerformanceAnalytics package) are discussed along with access to the author's code. Many topics I haven't discussed yet are available here as well, including MARS (Multivariate Adaptive Regression Splines), SVMs, and various validation techniques along with handy tabulation of results. Having read a previous draft, I'm still working into the examples, and welcome any feedback and thoughts I can address.

The book can be accessed via the amazon book showcase on the right and instructions for R code access are available in the book.

Conditioning Systems on Regime Variables

2010-08-10T13:22:00.000-07:00

Here is a brief and simple example of switching systems based upon regime type (sometimes called gating).

I've brought up the idea of conditioning systems based upon regimes many times in past posts. Some texts call this filtering, although I prefer to use the term conditional gating. The simple idea is to turn on a certain system during certain conditions and either: switching systems, or simply tracking the underlying series during alternate conditions. In this case the gating condition is regime, which in turn is, is High or Low Volatility as measured by the VIX. Although I'm not divulging the details of the underlying system itself, I've seen enough discussions in public domain to feel that other traders have picked up on the ideas demonstrated here.

Fig 1. Terminal Wealth vs. VIX threshold

The animation below shows the system results during each step of the conditioning variable, the VIX. Notice the dramatic improvement at the value of 23. Also, notice as mentioned in earlier posts how the optimal switching point of 23 is the most robust value, since even if the OOS results are to the left or right of the optimal switching point, they will be the best local values over a wide range of dependency. The astute observer might have noticed that this system is simply tracking buy&hold during low vix regimes, while switching on system V, during the high regimes. It is evident that the terminal wealth simply tracks buy & hold after a certain value of VIX, since it is always locked on to tracking mode under a certain threshold.

The system is only shown in sample, however, I've found it to be pretty successful OOS as well.

Video 1. Stepping the Equity Curve system through linear VIX range.

Quantitative Candlestick Pattern Recognition (Part 2 -- What's this Natural Language Processing stuff?)

2010-08-02T23:25:00.000-07:00

I wanted to briefly add one more thought regarding the temporal nature of probabilities as was alluded to in my correspondence with Adam, as well as the prior closing comments on the Chaos post (structure coalescing and dispersing).

I will borrow from the field of Natural Language Processing and introduce one common visual description of how the states evolve over time using something called a Lexical Dispersion Plot.

Fig 1. Lexical (cluster state vocabulary) Dispersion Plot of Clustered Candlestick States over time

In studies of language, we are often interested in observing how statistical patterns and relationships of sounds, characters, and words, evolve over time. Natural Language Processing is an entire field that has been dedicated to finding proper tools and vernacular to describe such statistics. The idea of using a lexical dispersion plot, is to observe how the lexicon itself evolves over time. To give a simple example, we might take a corpus of common pop culture texts borrowed from some library, and look at the occurrence of the following three word states; "spider", "man", and "spider man". The first two terms are isolated words, and the third term is called a bigram, which is a joint occurrence of two states in sequential order.

Now, although I haven't created the proposed lexical dispersion plot for the above scenario, one could reasonably expect for the number of occurrences of the single words, spider and man, to be relatively frequent and uniform from about 1900 to say 1960, while the joint pair (spider,man), might occur relatively sparsely. However, beyond the 60s, we would notice an increase in the joint pair (spider,man) as the popularity of the fictional character began to grow in popularity in the collective pop consciousness. We might also expect a large frequency of the bigram to occur with recent popularity in the films. However, it's possible that a few hundred years later, that the joint term and character popularity might just wane and eventually die off, even though the two unimodal terms (spider and man) are still frequently observed.

Ok, so what's the point to this? Well, we are commonly taught in statistics that there is a population that exists to describe the ultimate best statistical model of any observational set that lies somewhat beyond the notion of time (much like Plato's ideas of forms existing behind the scenes to describe all nature over all time, for philosophy fans).

But one of the things that disturbed me earlier on is exactly what I described in the prior paragraph on the joint bigram of spider man, which is that sometimes we have to pragmatically shed some of our beliefs about 'ideal' populations and just try to observe statistical phenomena as it occurs temporally. As mentioned in the Chaos quote, some patterns just spontaneously occur (spider man) for a while, then disappear over time. So, that the notion of a larger population existing behind the scenes (and all the statistical rigor associated with it), might be either overkill or even misleading towards our goal of trying to capture the essence of fleeting patterns. From a statistical viewpoint, I suppose I would lean more on the side of the Bayesian inference camp (constantly updating beliefs online, rather than the frequentist approach).

It's common knowledge in markets that financial time series are not IID (independent and independently distributed) over time. Rather, we accept that there are clusters of regions of behavior that tend to occur frequently together, and likewise, disappear over time (often reappearing again, though not always). This body of knowledge, specifically related to volatility, is sensibly labeled as heteroscedasticity (differing variance) as opposed to homoscedasticity (constant variance) of observations. We might also notice such behavior being binned and quantified into certain 'regimes' of local stability.

Now, if any of the above meandering made any sense, I will describe how it relates to the Quantitative Candlestick Pattern Recognition article. Recall that using clustering, we were attempting to identify a vocabulary of states that best describe a limited set of features (in the example, six states were identified) that best partition related candlestick symbols by state in an unsupervised manner. However, the dispersion plot in Fig 1. shows that viewed from a perspective of a central population, these states are not uniformly distributed (IID) over time, rather, some tend to occur frequently over relatively long periods of time, while others appear and disappear for reasonable windows of time. States one and two in the set tend to occur rather frequently, because they are very small moves (dojis and such), which tend to occur often over time. However, some of the larger moves captured in states 3 and 4, tend to persist for some periods, then disappear over other intervals. The likely explanation, is that larger moves tend to be associated with volatility, which as we know, exhibits heteroscedasticity (clustering together in time). Keep in mind, the dispersion example is not only limited to single symbols over time, but can be extended to any number of n-gram pairs or symbols (such as the two word bigram state for spider man).

With that knowledge in mind, it doesn't always make a whole lot of sense to try to develop and require a central fixed body of pattern statistics and related models over long periods of time, or even require many related statistical tests as neccessary (things like n-fold cross validation over very large time series, bootstrap re-sampling methods with shuffling, and requiring decades of backtesting training data to obtain confidence that we found the best pattern vocabulary to describe data for all time). For instance, in one of the better books on statistics for traders, "Evidence based TA," by Aronson, many of the tests were conducted using t-tests of entire bodies of financial series and rules over a long period of time, while rejecting many potential pockets of temporal success since they were thrown in and bootstrapped with much longer periods of data to draw conclusions about statistical significance of hypotheses related to better than chance success.

This is not to say that common trading statistics should be thrown out; not at all. Instead, it is hopefully to try to look at how the information being evaluated is processed over time (for instance, we may look at long term statistics of trade results, but focus more on short term statistics and modelling of the underlying patterns they are dependent upon).

Additionally, we might be interested in breaking up the pattern information stream into smaller segments and observing and adapting to how the segments of data streams evolve and change over time. The key savior or benefit to us, is that these patterns in the data streams do tend to persist together for quite some time (often reasonably long), before dispersing and moving on to new forms of patterns. There are several different machine learning concepts on the horizon that work with evolving (such as adding and pruning pattern model parameters) data streams over time and space. I have been spending some time evaluating one of them recently (although, I'm not saying which at the moment) which looks promising.

Chaos in the Financial Markets?

2010-07-06T09:59:00.000-07:00

Over the years I've had quite a few interested individuals ask me about Chaos and its applications towards trading. Well, as hidden markov models and speech processing were made popular by James Simons and his team at Renaissance Technologies, one could trace much of the popularity of Chaos theory and its financial applications to Norman Packard and Doyne Farmer, two former physicists working in the area of complex systems. Much of their story is discussed in the book, 'the Predictors,' by Thomas Bass. The two were on the forefront of new research in areas of complexity and chaos and had decided to parlay their knowledge into applications towards the financial markets as they founded the company called Prediction Company in Santa Fe, New Mexico. The company was swallowed by UBS systems in 2005.

Fig 1. 2D slice of Lorenz Attractor Phase Space signature A.K.A. The butterfly effect.

We could spend a lot of time discussing various facets of Chaos, as it is a very large field with many different related fields (such as fractals and complexity). But I want to focus on outlining a very simple understanding of why the field seemed so fascinating to traders and quants alike. Most experts in time series run a slew of tests to demonstrate that markets exhibit no predictable order.
However, what makes Chaos so fascinating is that certain time series may seem to pass a battery of common statistical tests for randomness, yet are perfectly deterministic.

Chaos is a field of science that is engaged in studying non-linear dynamical behavior of systems. A popular example might be how different planetary bodies exhibit forces upon one another (see Poincare), or turbulent flow of various particles. Those engaged in studying such systems often like to observe signatures in a domain known as phase space or state space. Rather than look at the time series unfold over time, they are looking for a type of order that underlies the trajectories of the system state dynamics as it unfolds over time. The plot of the trajectory may show structural order that may appear random in the time domain. One of the most popular attractor signatures is the well known Lorenz attractor, which is better known to popular literature as the butterfly effect. Fig 1, displayed earlier, shows a 2d slice of the phase space trajectory, that you can run over at chaos applet. The famous signature displays a fascinating case of underlying order that relates to dynamic atmospheric convection in three dimensions.

A more simple and applicable model that we will look at for illustration purposes, is the well known Logistic Map (also known as quadratic map and Fiegenbaum map). This equation of a non-linear dynamical trajectory was investigated and attributed to Robert May, a biologist studying models of fish populations.

The recursive equation for the Logistic Map is: xnext = r*x(1-x)
Note that this is a feedback system which has some control over the dynamics of the system model by varying the value r. What you'll see if you plot it out vs. the control coefficient, r, is that the series moves from a stable system to one which bifurcates into periodic cycles; and as it approaches the value 4 it starts to behave chaotically. Chaotic behavior is aperiodic, which (like financial series) never repeats exactly; but (unlike financial series) has an underlying deterministic order. In order to see the beauty of chaos and how it exhibits determinism, let's first look at the time series.

Fig 2. Time Series of Logistic Map Equation.

Notice the 1st plot, which shows the 1st differenced time series, shows no signs of periodicity or determinism, nor does the cumulative walk display on the 2nd. However, if we look at the phase space signature of the same series in phase space, we see a completely different picture.

Fig 3. Phase State Plot of Logistic Map

Notice it is clearly deterministic in this figure. I.e. given any point on the x axis, we can easily determine the exact corresponding point one step into the future on the y axis. This should be fairly obvious, since the equation we started out with xnext=r*x(1-x) = r*x-r*x^2 is a negative parabolic curve. However, many such time series do not have such a simple equation and must be tested in various ways to determine structural chaos. There are other issues to contend with as well, including sensitivity to initial conditions and divergent trajectories due to finite computational precision.

Ok, now that we understand all the hoopla about Chaos and a seemingly random signal having an underlying deterministic signature, what about a common financial time series?

Fig 4. Typical Random Walk (Financial) Time Series.

The Random Walk shows no discernible order, nor periodicity; similar to the logistic equation series. But what about if we observe the phase space trajectory?

Fig 5. Phase Space trajectory of random walk.

No Dice. Notice that the random walk shows zero determinism, hence the gaussian nature. There are numerous methods to display higher order dimensional metrics as well (correlation lag plots, Lyapunav exponents, etc), and other than something called the compass rose, I've not personally seen much evidence of deterministic chaos in raw financial time series. Incidentally, the phase plot here is equivalent to a lag one scatterplot of returns for those more familiar with finance related statistics.

A last note on the Prediction Company, is that there is an often referenced paper on the mackey-glass equation
(a non linear dynamic model of blood flow) by Meyer and Packard, whereby they used genetic algorithms to find underlying conditional order rule sets for the series.

I'll end with perhaps the best excerpt from 'the Predictors,' which echoes much of my own focus and discoveries over the last decade...

"One of the fundamental truths about the markets is that the dynamics are nonstationary," Norman explains. "We see no evidence for the existence of an attractor with stable statistical properties. This is what characterizes chaos -- having an attractor with stable statistical properties-- so what we are seeing is not chaos. It is something else. Call it an 'even-stranger-than-strange attractor,' which may not really be an attractor at all.

The market might enter an epoch where some structure coalesces and sits there in a statistically stationary pattern, but then invariably it disappears. You have clouds of structure that coalesce and evaporate, coalesce and evaporate. Prediction Company's job is to find those pieces of structure that have the strongest signal and persist the longest. We want to know when the structure is beginning to emerge or dissolve because, once it begins to dissolve, we want to stop betting on it."

...excerpt from the Predictors, Thomas A. Bass (1999).

Quantitative Candlestick Pattern Recognition (HMM, Baum Welch, and all that)

2010-06-10T15:24:00.000-07:00

Fig 1. Clustering based approach to candlestick Pattern Recognition.

I've been reading a book titled, 'the Quants,' that I'm sure will tantalize many traders with some of the ideas embedded within. Most notably (IMO), the notion that Renaissance's James Simons, hired a battery of cryptographers and speech recognition experts to decipher the code of the markets. Most notable of the hired experts was, leonard Baum, co-developer of the Baum-Welch algorithm; an algorithm used to model hidden markov models. Now while I don't plan to divulge full details of my own work in this area; I do want to give a brief example of some methods to possibly apply with respect to these ideas.

Now most practitioners of classical TA have built up an enormous amount of literature around candlesticks, the japanese symbols used to denote a symbolic formation around open, high, low, and close daily data. The problem as I see it, is that most of the literature that is available only deals with qualitative recognition of patterns, rather than quantitative.

We might want to utilize a more quantitative approach to analyzing the information, as it holds much more potential than single closing price data (i.e. the information in each candle contains four dimensions of information). The question is, how do we do this in a quantitative manner?

One method to recognizing patterns that is well known is the supervised method. In supervised learning, we feed the learner a correct list of responses to learn and configure itself from; over many iterations, it comes up with the most optimal configuration to minimize errors between the data it learns to identify, and the data we feed as examples. For instance, we might look at a set of hand-written characters and build a black box to recognize each letter by training via a neural net, support vector machine, or other supervised learning device. However, you probably don't want to spend hours classifying the types of candlesticks by name. Those familiar with candlesticks, might recognize numerous different symbols; shooting star, hammer, doji, etc... Each connotating a unique symbolic harbinger of the future to come. From a quantitative perspective, we might be more interested in understanding the bayesian perspective; I.e. P(upday|hammer)=P(upday,hammer)/P(hammer), for instance.

But how could we learn the corpus of symbols, without the tedious method of identifying each individual symbol by hand? This is a problem that may better be approached by unsupervised learning. In unsupervised learning, we don't need to train a learner; it finds relationships by itself. Typically, the relationships are established as a function of distance between exemplars. Please see the data mining text (Witten/Frank) in my recommended list in order to examine the concepts in more detail. In this case I am going to train using a very common unsupervised learner called k-means clustering.

Fig 2. Graph of arbitrary window of QQQQ data

Notice the common time ordered candlestick form of plotting is displayed in Fig 2. Now using k-means clustering, with a goal of identifying 6 clusters, I tried to automatically learn 6 unique candlestick forms based on H,L,Cl data relative to Open in this example. The idea being that similar candlestick archetypes will tend to cluster by distance.

Fig 3. Candlestick symbols sorted by 6 Clusters

Notice in Figure 3, that we can clearly see the k-means clustering approach automatically recognized large red bodies, green bodies, and even more interestingly, there are a preponderance of hammers that were automatically recognized in cluster number 5.

So given that we have identified a corpus of 6 symbols in our language, of what use might this be? Well, we can take and run a cross tabulation of our symbol states using a program like R.

Fig 4. Cross Tabulation of Clustered States.

One of the things that strikes me right away is that there are an overwhelming number of pairs with state 1s following state 5; Notice the 57% frequency trounces all other dependent states. Now what is interesting about this? Remember we established that state 5 corresponds to a hammer candlestick? Well, common intuition (at least from my years of reading) expects that a hammer is a turning point that is followed by an up move. Yet, in our table we see it is overwhelmingly followed by state 1, which if you look back, at the sorted by cluster diagram, is a very big red down candlestick. This is completely opposite to what our common body of knowledge and intuition tells us.

In case it might seem unbelievable to fathom, we can resort the data again, this time in original time order, but with clusters identified.

Fig 5. Visual Inspection of hammer (state5) likely followed by down candle (state 1)

We can go back and resort the data and identify the states via the resorted cluster ID staircase level(5), or use labels to more simply identify the case of hammer(5) and its following symbol. Notice that contrary to common knowledge, our automatic recognition process and tabulated probability matrix, found good corroboration with our visual inspection. In the simple window sample (resized to improve visibility), 4 of the 5 instances of the hammer (state 5) were followed by a big red down candle (state 1). Now one other comment to make is that in case the state 5 is not followed by state 1 (say, we bet on expecting a down move), it has a 14.3% chance of landing in state 6 on the next move, which brings our likelihood of a decent sized down move to 71.4% overall.

We can take these simple quantitative ideas and extend them to MCMC dynamic models, Baum Welch and Viterbi algorithms, and all that sophisticated stuff. Perhaps one day even mimicking the mighty Renaissance itself? I don't know, but any edge we can add to our arsenal will surely help.

Take some time to read the Quants, if you want a great laymen's view of many related quant approaches.

There may be some bugs in this post, as google just seemed to update their editing platform, and I'm trying to iron out some of the kinks.

The Kalman Filter For Financial Time Series

2010-05-25T10:41:00.001-07:00

Every now and then I come across a tool that is so bogged down in pages of esoteric mathematical calculations, it becomes difficult to get even a simple grasp of how or why they might be useful. Even worse, you exhaustively search the internet to find a simple picture that might express a thousand equations, but find nothing. The kalman filter is one of those tools. Extremely useful, yet, very difficult to understand conceptually because of the complex mathematical jargon. Below is a simple plot of a kalman filtered version of a random walk (for now, we will use that as an estimate of a financial time series).

Fig 1. Kalman Filter estimates of mean and covariance of Random Walk

The kf is a fantastic example of an adaptive model, more specifically, a dynamic linear model, that is able to adapt to an ever changing environment. Unlike a simple moving average or FIR that has a fixed set of windowing parameters, the kalman filter constantly updates the information to produce adaptive filtering on the fly. Although there are a few TA based adaptive filters, such as Kaufman Adaptive Moving Average and variations of the exponential moving average; neither captures the optimal estimation of the series in the way that the KF does. In the plot in Fig 1. We have a blue line which represents the estimated dynamic 'average' of the underlying time series, where the red line represents the time series itself, and lastly, the dotted lines represent a scaled covariance estimate of the time series against the estimated average. Notice that unlike many other filters, the estimated average is a very good measure of the 'true' moving center of the time series.

Without diving into too much math, the following is the well known 'state space equation' of the kf:
xt=A*xt-1 + w
zt=H*xt + v

Although these equations are often expressed in state space or matrix representation, making them somewhat complicated to the layman, if you are familiar with simple linear regression it might make more sense.
Let's define the variables:
xt is the hidden variable that is estimated, in this case it represents the best estimate of the dynamic mean or dynamic center of the time series
A is the state transition matrix, or I often think of it as similar to the autoregressive coefficient in an AR model; think of it as Beta in a linear regression here.
w is the noise of the model.

So, we can think of the equation of x=Ax-1 + w as being very similar to the basic linear regression model, which it is. The main difference being that the kf constantly updates the estimates at each iteration in an online fashion. Those familiar with control systems might understand it as a feedback mechanism, that adjusts for error. Since we can not actually 'see' the true dynamic center in the future, only estimate it, we think of x as a 'hidden' variable.

The other equation is linked directly to the first.
zt=H*xt+v
zt is the measured noisy state variable that has a probabilistic relationship to x.
xt we recognize as the estimate of the dynamic center of the time series.
v is the noise of the model.

Again, it is a linear model, but this time the equation contains something we can observe: zt is the value of the time series we are trying to capture and model with respect to xt. More specifically, it is an estimate of the covariance, or co-movement between the observed variable, the time series value, and the estimate of the dynamic variable x. You can also think of the scaled envelope it creates as similar to a standard deviation band that predicts the future variance of the signal with respect to x.

Those familiar with hidden markov models, might recognize the concept of hidden and observed state variables displayed here.

Basically, we start out estimating our guess of the the average and covariance of the hidden series based upon measurements of the observable series, which in this case are simply the normal parameters N(mean, std) used to generate the random walk. From there, the linear matrix equations are used to estimate the values of cov x and x, using linear matrix operations. The key is that once an estimate is made, the value of the covariance of x is then checked against the actual observable time series value, y, and a parameter called K is adjusted to update the prior estimates. Each time K is updated, the value of the estimate of x is updated via:
xt_new_est=xt_est + K*(zt - H*x_est). The value of K generally converges to a stable value, when the underlying series is truly gaussian (as seen in fig 1. during the start of the series, it learns). After a few iterations, the optimal value of K is pretty stable, so the model has learned or adapted to the underlying series.

Some advantages to the kalman filter are that is is predictive and adaptive, as it looks forward with an estimate of the covariance and mean of the time series one step into the future and unlike a Neural Network, it does NOT require stationary data.
Those working on the Neural Network tutorials, hopefully see a big advantage here.

It has a very close to smooth representation of the series, while not requiring peeking into the future.

Disadvantages are that the filter model assumes linear dependencies, and is based upon noise terms that are gaussian generated. As we know, financial markets are not exactly gaussian, since they tend to have fat tails more often than we would expect, non-normal higher moments, and the series exhibit heteroskedasticity clustering. Another more advanced filter that addresses these issues is the particle filter, which uses sampling methods to generate the underlying distribution parameters.

--------------------------------------------------------------------------------
Here are some references which may further help in understanding of the kalman filter.
In addition, there is a kalman smoother in the R package, DLM.

http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html

If you are interested in a Python based approach, I highly recommend the following book...Machine Learning An Algorithmic Perspective

Not only is there a fantastic writeup on hidden markov models and kalman filters, but there is real code you can replicate. It is one of the best practical books on Machine Learning I have come across-- period.