The bad news is that I didn't fare very well in the contest (edit* looks like I made top 25%), however, there were a lot of interesting issues that surfaced as the contest progressed. I joined the contest fairly late and read some posts about how the data was altered in midstream. To understand this better, it might help to describe the data in more detail. The original dataset was comprised of two subsets, a train dataset with 40000 obs.and 211 features, and a test dataset with 40000 obs. and 147 features. The features could be broken into an ID column, a mix of 25 unlabeled continuous and discrete features, and 183 ordered time series returns. The time series returns were further broken down into -D1,-D2,1minD ,+D1,+D2 : the 1 min data represented a range of 179 intraday 1 min returns. The D1,D2 returns were prior overnight, and post daily returns, respectively. The goal was to predict the intraday and post two day returns beyond the original time series data points. So we were given roughly 2/3 of the prior returns and needed to predict 1/3 of the out of sample returns. Contestants were asked to submit the predicted test data results and received feedback on a reserved portion of the private dataset that were displayed as public leaderboard results. Typically, contestants can use the public leaderboard as a gauge of how well their models are performing on hidden, out of sample data. The overall final performance on the private leaderboard is withheld until the contest ends.
Many posters quickly figured out that the task was rather daunting and suspected that a zero baseline (just predict all zeros) or median of time series observations were pretty difficult to beat. However, what had changed in the contest, was that there was a possibility that some posters could have gamed the data and reverse engineered it to obtain very good scores on the public leaderboard. In response, the contest administrators decided to add additional test data to the original private test dataset, it was upped from 40000 unseen observations to 120000 observations. This was done to try to deter any gaming of the system. It also made gauging cross validation much more difficult as we were only given leaderboard feedback on 20000 of the observations. So we had to try to figure out the most robust model, given only 25% of training data, and make it robust to 75% of new hidden data! This with performance feedback on only 16.7% of the hidden data results being displayed on the public leaderboard. Additionally, some posters suspected that the data structures of the new dataset had been distorted from its original characteristics in order to avoid any gaming. I'm hoping to get more feedback from the contest and providers, in order to understand the data better.
My own approach was to use a gradient boosted model from the R gbm package. GBMs have historically performed well on kaggle competitions, have been shown to be one of the best performing models on a variety of datasets (see Kuhn, Applied Predictive Modelling), and are known to be a good choice for a robust out of the box model. Some of the benefits of GBMs, are ability to deal with NAs and handling of unscaled data, so we don't have to worry much about pre-procesisng data. As I didn't have much time to explore other options, I (perhaps wrongly) spent most of my time on improving the gbm cross-validated performance. Like others, I found that my public leaderboard score was consistently at the high range of my cv performance results. Since there was so much discussion about data manipulation, I (again likely wrongly) stubbornly believed in the cv results, and assumed that they might have biased the leaderboard set towards the worse performing data. In retrospect, I should have maybe spent more time on extremely simple models (like means and medians) as opposed to fitting a relatively small set of training data with a complex model. I also found many odd data characteristics, like intraday data columns that were perfectly trending up in the training set and down in the test set, some data had very little noise, others had lots of noise. My intuition was that not all of the time series data were from true financial time series, and that the distributions were not stable from training to test sets. This made the problem more difficult, IMO, then dealing with real data and having knowledge of the underlying data. I attempted to filter for different training and test samples, by using t-tests for comparisons, and setting a threshold for removing unusually different time series data columns.
Fig. 2. First 25 unlabled features highlighting unstable trn/tst distributions in Feauture_7.
Some posters observed that Feature_7 was unstable between train and test samples. Unfortunately, that was also the same feature that was highlighted as the most important feature in both D1, and D2 predictions from my GBM model. I left it in as I didn't have enough time to change strategies at that point and figured that I still had reasonable CV estimates that would beat the zero and median estimates on average. I ended up just using median estimates of training features for the out of sample 1 min predictions, and used a GBM to predict the next two out of sample days. We were also given a vector of loss weights for the training data and were instructed to use weighted mean absolute error as our loss function. R's GBM models allow us to set 'Laplace' as an absolute loss function. I also wrote out the loss funciton to manually record it in the cross validated loops and get a better feel for the sensitvity of the validation folds. A loop was created to record performance on 5 validation folds across several tree lengths. The optimal value was then used to rebuild final training models on all the training data, that could then be used to predict values for the test fold.
Fig. 3. below shows cross validated results from a simple GBM model using 5 fold validation, with shrinkage = 0.1, n.minobsinnode=50, interaction.depth =1. The red line shows the median value for the zero baseline as a comparison to the WMAE.val folds across different tree iterations. Notice that selecting, say n.trees = 900, yields a range of results that were all superior to the baseline median. I woud expect somewhere around 1765 for this model; however, as explained earlier, the private test set was difficult to match performance as distribution of features were different. I used the run below to illustrate, but the actual model I built had a lower shrinkage of 0.01 and much better performance, but took a long time to run -- as I had to use about 8000 trees.
Fig 3. 5 fold validation results with simple GBM model, shrinkage=0.1 n.minobsinnode=50 interaction.depth = 1
In conclusion, I was able to use R to build a GBM model with good cross validated performance against the baseline zero metric, but the model did not perform very well on the hidden private data. I believe two big factors were differences in train and test datasets, as well as a much larger hold out dataset that was different enough that the true results were much worse than 5 fold CV would predict. It's equally possible that the model was too complex for the very noisy dataset.
Some posters were wondering about why WMAE was used as opposed to RMSE. In financial time series, MAE is often used as a loss function as it is more robust to outliers. You can imagine why large outliers will not only skew the data and fit, but their contributions will effectively be squared using RMSE.
Lastly, although it was a great excercise in machine learning and real world modeling. In my experience, targeting continuous variables and using weighted mean absolute value loss functions may not be the best fitness to target. The great thing that it illustrates is the practical difficulty in predicting financial time series. Thanks to Winton and Kaggle for a great experience!
#####################################################
Code is below. Note* I used Pretty R to embed the code but there are color clashes with the background. You can copy and paste into a local browser to view more clearly. If ayone has suggestions on how to change the PrettyR font colors, please let me know.
library(gbm) library(doParallel) library(reshape2) #trn <- read.csv("E:/Kaggle/Winton/train.csv") # 40k X 211 #tst <- read.csv("E:/Kaggle/Winton/test_2.csv") #120k X 147 trn <- readRDS("E:/Kaggle/Winton/trn.rds"); tst <- readRDS("E:/Kaggle/Winton/tst.rds") interp <- function(x){ x <- as.numeric(x) if(is.na(x[1])) x[1] <- x[!is.na(x)][1] #if(is.na(x[length(x)])) x[length(x)] <- x[length(x)-1] xsub <- x[2:length(x)] #gap.na <- which(is.na(xsub)) for(k in 2:length(xsub)){ if(is.na(xsub[k])) xsub[k] <- xsub[k-1] } x <- c(x[1],xsub) return(x) } disc.impute <- function(x){ xfactor <- levels(factor(x)) x[is.na(x)] <- sample(xfactor,length(which(is.na(x))),replace=TRUE) return(as.numeric(x)) } cont.impute <- function(x){ x[is.na(x)] <- runif(length(which(is.na(x))), min(range(na.omit(x))),max(range(na.omit(x)))) return(x) } generate.folds <- function(idx, k){ fold.idx <- holdout <- kfold <- list() fold.len <- floor(length(idx)/k) # if idx does not divide evenly ignore last few observations for(i in 1:(k-1)){ holdout[[i]] <- sample(idx[!(idx%in%unlist(holdout))],fold.len) kfold[[i]] <- idx[!(idx%in%holdout[[i]])] } holdout[[k]] <- idx[!(idx%in%unlist(holdout))] kfold[[k]] <- idx[!(idx%in%holdout[[k]])] folds <- list(kfold,holdout) names(folds) <- c('trn','val') return(folds) } qlimiter <- function(x,qlim) { ifelse(x < quantile(x,(1-qlim),na.rm=TRUE),quantile(x,(1-qlim),na.rm=TRUE),ifelse( x > quantile(x,qlim,na.rm=TRUE),quantile(x,qlim,na.rm=TRUE),x)) } feat.dist.sel <- function(mx,my,pth){ out <- 0 for(i in 1:dim(mx)[2]){ out[i] <- t.test(mx[,i],my[i])$p.value } sel <- which(out > pth) return(sel) } interp.med <- function(x) { x[is.na(x)] <- median(na.omit(x)); x } # column breakpoints f1 <- 2; f25 <- 26; tsiis1 <- 27; tsiis2 <- 147; tsoos1 <- 148; tsoos2 <- 207; tsD1 <- 208; tsD2 <- 209; wtint <- 210; wtdly <- 211; trn.obs <- dim(trn)[1]; tst.obs <- dim(tst)[1]; clip.th <- 0.9999 discrete.features <- c(1,5,8,9,10,13,16,20) continuous.features <- c(2,3,4,6,7,11,12,14,15,17,18,19,21,22,23,24,25) # preprocess and clean data all.iis <- rbind(trn[,1:tsiis2], tst[,1:tsiis2]) all.trn.iis.ts <- trn[,tsiis1:tsiis2] all.trn.oos.ts <- trn[,tsoos1:tsoos2] all.trn.oos.ts.cln <- apply(all.trn.oos.ts,2,interp.med) all.trn.oos.D1 <- trn[,tsD1] all.trn.oos.D2 <- trn[,tsD2] all.trn.oos.D1.clip <- qlimiter(all.trn.oos.D1,qlim=clip.th) all.trn.oos.D2.clip <- qlimiter(all.trn.oos.D2,qlim=clip.th) all.iis.ts <- all.iis[,tsiis1:tsiis2] all.tst.iis.ts <- all.iis.ts[(trn.obs+1):(trn.obs+tst.obs),] ts.sel <- feat.dist.sel(all.trn.iis.ts,all.tst.iis.ts,pth=.05) all.iis.ts.sel <- all.iis.ts[,ts.sel] all.iis.ts.cln <- apply(all.iis.ts.sel,2,interp.med) all.trn.iis.ts.cln <- all.iis.ts.cln[1:trn.obs,] all.tst.iis.ts.cln <- all.iis.ts.cln[(trn.obs+1):(trn.obs+tst.obs),] # K FOLD CV folds <- list(); n.fold <- 5 folds <- generate.folds(seq(dim(trn)[1]), n.fold) #gbmGrid <- expand.grid(interaction.depth = c(5,9,15), n.trees = seq(400,1000,by=200), #shrinkage = 0.1, n.minobsinnode = 50) out <- res <- list() tune.par <- seq(10,200,10) for(z in 1:length(tune.par)){ cl <- makeCluster(n.fold) registerDoParallel(cl) res <- list() rmcol = 0 n.trees <- tune.par[z] interaction.depth <- 1 n.minobsinnode <- 50 shrinkage <- 0.1 res <- foreach(k=1:n.fold, .combine=rbind) %dopar% { library(gbm) trn.fld <- folds$trn[[k]] val.fld <- folds$val[[k]] # Naive estimates use zero baseline as predictions trn.iis <- all.trn.iis.ts.cln[trn.fld,] ; val.iis <- all.trn.iis.ts.cln[val.fld,] trn.int.oos <- all.trn.oos.ts.cln[trn.fld,] ; val.int.oos <- all.trn.oos.ts.cln[val.fld,] trn.day.D1 <- all.trn.oos.D1.clip[trn.fld] ; val.day.D1 <- all.trn.oos.D1.clip[val.fld] trn.day.D2 <- all.trn.oos.D2.clip[trn.fld] ; val.day.D2 <- all.trn.oos.D2.clip[val.fld] wt.int.trn <- trn[trn.fld,wtint]; wt.int.val <- trn[val.fld,wtint] wt.day.trn <- trn[trn.fld,wtdly]; wt.day.val <- trn[val.fld,wtdly] trn.zro <- rep(0,dim(trn.iis)[1]); #zero pred for intraday val.zro <- rep(0,dim(val.iis)[1]); #zero pred for intraday trn.err.int <- wt.int.trn*abs(trn.int.oos-trn.zro) val.err.int <- wt.int.val*abs(val.int.oos-val.zro) trn.err.D1 <- wt.day.trn*abs(trn.day.D1-trn.zro) val.err.D1 <- wt.day.val*abs(val.day.D1-val.zro) trn.err.D2 <- wt.day.trn*abs(trn.day.D2-trn.zro) val.err.D2 <- wt.day.val*abs(val.day.D2-val.zro) WMAE.trn <- mean(c(as.matrix(trn.err.D1),as.matrix(trn.err.D2),as.matrix(trn.err.int))) #oos trn wmae WMAE.val <- mean(c(as.matrix(val.err.D1),as.matrix(val.err.D2),as.matrix(val.err.int))) #oos tst wmae WMAE.zro.all <- mean(c(WMAE.trn,WMAE.val)) # feature selection and cleaning outliers all.feat <- all.iis[,f1:f25] all.feat[,3] <- qlimiter(all.feat[,3],.9999) all.feat[,11] <- qlimiter(all.feat[,11],.9999) trn.feat <- all.feat[trn.fld,] val.feat <- all.feat[val.fld,] trn.day.1 <- data.frame(trn.feat,all.trn.iis.ts.cln[trn.fld,],all.trn.oos.D1.clip[trn.fld]) names(trn.day.1)[dim(trn.day.1)[2]]<-'D1' trn.day.2 <- data.frame(trn.feat,all.trn.iis.ts.cln[trn.fld,],all.trn.oos.D2.clip[trn.fld]) names(trn.day.2)[dim(trn.day.2)[2]]<-'D2' val.day.1 <- data.frame(val.feat,all.trn.iis.ts.cln[val.fld,],all.trn.oos.D1.clip[val.fld]) names(val.day.1)[dim(val.day.1)[2]]<-'D1' val.day.2 <- data.frame(val.feat,all.trn.iis.ts.cln[val.fld,],all.trn.oos.D2.clip[val.fld]) names(val.day.2)[dim(val.day.2)[2]]<-'D2' # Build gbm models, here: use same ntrees for both D1,D2 gbm.trn.mod.D1 <- gbm(D1 ~ .,data=trn.day.1,distribution="laplace", weights = wt.day.trn, n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode) gbm.trn.mod.D2 <- gbm(D2 ~ .,data=trn.day.2,distribution="laplace", weights = wt.day.trn, n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode) # TRN fold predictions gbm.trn.pred.1 <- predict(gbm.trn.mod.D1, newdata=trn.day.1[,-dim(trn.day.1)[2]], n.trees=n.trees) gbm.trn.pred.2 <- predict(gbm.trn.mod.D2, newdata=trn.day.2[,-dim(trn.day.2)[2]], n.trees=n.trees) gbm.trn.act.1 <- all.trn.oos.D1[trn.fld] gbm.trn.act.2 <- all.trn.oos.D2[trn.fld] #par(mfrow=c(2,1)) #plot(gbm.trn.pred.1~gbm.trn.act.1) #plot(gbm.trn.pred.2~gbm.trn.act.2) trn.err.day.1 <- wt.day.trn*abs(gbm.trn.act.1 -gbm.trn.pred.1) trn.err.day.2 <- wt.day.trn*abs(gbm.trn.act.2 -gbm.trn.pred.2) trn.err.dly <- data.frame(trn.err.day.1,trn.err.day.2) trn.err.int <- trn.err.int # from previous median estimates, wt.int.trn*abs(trn.int.oos-trn.med) WMAE.trn <- mean(c(as.matrix(trn.err.dly),as.matrix(trn.err.int))) # VAL fold predictions gbm.val.pred.1 <- predict(gbm.trn.mod.D1, newdata=val.day.1[,-dim(val.day.1)[2]], n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode) gbm.val.pred.2 <- predict(gbm.trn.mod.D2, newdata=val.day.2[,-dim(val.day.2)[2]], n.trees=n.trees, interaction.depth=interaction.depth, shrinkage = shrinkage, n.minobsinnode =n.minobsinnode) gbm.val.act.1 <- all.trn.oos.D1[val.fld] gbm.val.act.2 <- all.trn.oos.D2[val.fld] #par(mfrow=c(2,1)) #plot(gbm.val.pred.1~gbm.val.act.1) #plot(gbm.val.pred.2~gbm.val.act.2) val.err.day.1 <- wt.day.val*abs(gbm.val.act.1-gbm.val.pred.1) val.err.day.2 <- wt.day.val*abs(gbm.val.act.2-gbm.val.pred.2) val.err.dly <- data.frame(val.err.day.1,val.err.day.2) val.err.int <- val.err.int # from previous median estimates, wt.int.trn*abs(trn.int.oos-trn.med) WMAE.val <- mean(c(as.matrix(val.err.dly),as.matrix(val.err.int))) res[[k]] <- data.frame(k,interaction.depth,n.trees,shrinkage,n.minobsinnode,WMAE.zro.all,WMAE.trn,WMAE.val) } out[[z]] <- t(do.call(rbind,res)) stopCluster(cl) gc() } # aggregate and plot data { out.folds <- do.call(rbind,out) mean.res <- aggregate(cbind(WMAE.zro.all,WMAE.trn,WMAE.val)~n.trees,data=out.folds,mean) plot(mean.res$WMAE.trn~mean.res$n.trees,type='o',pch=19,col='red') lines(mean.res$WMAE.val~mean.res$n.trees,type='o',pch=19,col='blue') yr <- range(mean.res$WMAE.zro.all,mean.res$WMAE.trn,mean.res$WMAE.val) plot(mean.res$WMAE.zro.all~mean.res$n.trees,type='o',col='blue',pch=19,ylim=yr,main= '8 flds sweep',ylab='8 FOLD CV WMAE') text(mean.res$WMAE.zro.all~mean.res$n.trees,cex=.8,col='black',pos=3,label = round(mean.res$WMAE.zro.all,2)) lines(mean.res$WMAE.trn~mean.res$n.trees,type='o',pch=19,col='green',ylim=yr) text(mean.res$WMAE.trn~mean.res$n.trees,cex=.8,col='black',pos=3,label = round(mean.res$WMAE.trn,2)) lines(mean.res$WMAE.val~mean.res$n.trees,type='o',pch=19,col='red',ylim=yr) text(mean.res$WMAE.val~mean.res$n.trees,cex=.8,col='black',pos=3,label = round(mean.res$WMAE.val,2)) legend(800,1*yr[2],c("mean.res$WMAE.zro","mean.res$WMAE.trn","mean.res$WMAE.val"), col=c('blue','green','red'),pch=19) dev.new() CV.df <- out.folds[,c(3,c(6:8))] CV.df <- transform(CV.df,ID=seq(dim(CV.df)[1])) df.m <- melt(CV.df, id.var=c('ID','n.trees')) boxplot(value~variable+n.trees,df.m) } ############################################################ ########## FINAL all TRN model generate/WMAE ############################################################ n.trees <- 900 interaction.depth <- 1 n.minobsinnode <- 50 shrinkage <- 0.1 all.feat <- all.iis[,f1:f25] #limit useless or sparse feature outliers all.feat[,3] <- qlimiter(all.feat[,3],.9999) all.feat[,11] <- qlimiter(all.feat[,11],.9999) trn.feat <- all.feat[1:trn.obs,] trn.day.1 <- data.frame(trn.feat,all.trn.iis.ts.cln,all.trn.oos.D1.clip) names(trn.day.1)[dim(trn.day.1)[2]]<-'D1' trn.day.2 <- data.frame(trn.feat,all.trn.iis.ts.cln,all.trn.oos.D2.clip) names(trn.day.2)[dim(trn.day.2)[2]]<-'D2' trn.day.1 <- data.frame(trn.feat,all.trn.oos.D1.clip) names(trn.day.1)[dim(trn.day.1)[2]]<-'D1' trn.day.2 <- data.frame(trn.feat,all.trn.oos.D2.clip) names(trn.day.2)[dim(trn.day.2)[2]]<-'D2' wt.day.trn <- trn[,wtdly] gbm.trn.mod.D1 = gbm(D1 ~ .,data=trn.day.1, distribution="laplace",weights = wt.day.trn,n.trees=n.trees,shrinkage=0.1 ,cv.folds=5) gbm.trn.mod.D2 = gbm(D2 ~ .,data=trn.day.2, distribution="laplace",weights = wt.day.trn,n.trees=n.trees,shrinkage=0.1 ,cv.folds=5) ############################################################ ########## OOS TST GENERATE ############################################################ tst.feat <- all.feat[(1+trn.obs):dim(all.feat)[1],] tst.day.1 <- data.frame(tst.feat,all.tst.iis.ts.cln) tst.day.2 <- data.frame(tst.feat,all.tst.iis.ts.cln) all.oos.int.pred <- sapply(all.trn.oos.ts,median,na.rm=TRUE) tst.pred.mtx <- t(matrix(rep(all.oos.int.pred,dim(tst)[1]), length(all.oos.int.pred),dim(tst)[1])) gbm.tst.pred.1 <- predict(gbm.trn.mod.D1, newdata=tst.day.1, n.trees = n.trees) gbm.tst.pred.2 <- predict(gbm.trn.mod.D2, newdata=tst.day.2, n.trees = n.trees) tst.submission.mtx <- cbind(tst.pred.mtx,gbm.tst.pred.1,gbm.tst.pred.2) tstpred <- as.vector(t(tst.submission.mtx)) submission <- read.csv("E:/Kaggle/Winton/sample_submission_2.csv") submission$Predicted <- tstpred write.csv(submission,"E:/Kaggle/Winton/submission1.csv",row.names=FALSE) subtst <- read.csv("E:/Kaggle/Winton/submission1.csv")