Tuesday, January 31, 2012

MINE: Maximal Information-based NonParametric Exploration


There was a lot of buzz in the blogosphere as well as the science community about a new family of algorithms that are able to find non-linear relationships over extremely large fields of data. What makes it particularly useful is that the measure(s) it uses are based upon mutual information rather than standard pearson's correlation type measures, which do not capture non-linear relationships well.

The (java based) software can be downloaded here: http://www.exploredata.net/ In addition, there is the capability to directly run the software from R.

 Fig 1. Typical non-linear relationship exemplified by intermarket relationships.

The algorithm seems promising as it would allow us to possibly mine very large data sets (such as financial intermarket relationships) and find potentially meaningful non-linear relationships. If we were to use the typical pearson's correlation measures, such relationships would show very small R^2 values, and thus be discarded as non significant relationships.

I decided to take it for a spin on an example of a non-linear example, taken from M. Katsanos' book on intermarket trading strategies (p 25. fig 2.3). In figure 1, we can clearly see that the relationship between markets is non-linear, and thus the traditional linear fit returns a low R^2 value of .143 (red line), a loess fit is also shown in blue. After running the same data through MINE, the results returned in a .csv file, were...


MIC           (strength)    MIC-p^2          (nonlinearity)
0.16691002    0.62445            7.129283    -0.3777441


The MIC (Mutual Information Coefficient) of .167 was not much greater than theR^2 measure of .143 above. However, one of the mentions in the paper was that as the signal becomes more obscured by noise, the MIC will degrade comparably.

The next step would be too find some type of fit to minimize the noise component and make updated comparisons.

In order to show a better illustration of how useful it might be, I am attaching a screenshot of the reference material here.



Figure 2. Reproduced from Fig 6. 'www.sciencemag.org/cgi/content/full/334/6062/1518/DC1'

Notice the MIC Score measure outperforms other traditional methods on many non-linear structural relationships.

Here is the full R-Code to repeat the basic experiment.
###############################################
# MINE example from intelligenttradingtech.blogspot.com 1/31/2012

library(quantmod)
library(ggplot2)

getSymbols('^GSPC',src='yahoo',from='1992-01-07',to='2007-12-31')
getSymbols('^N225',src='yahoo',from='1992-01-07',to='2007-12-31')

sym_frame<-merge(GSPC[,6],N225[,6],all=FALSE)
names(sym_frame)<-c('GSPC','N225')

p<-qplot(N225, GSPC, data=data.frame(coredata(sym_frame)),
geom=c('point'), xlab='NIKKEI',ylab='S&P_500',main='S&P500 vs NIKKEI 1992-2007')

fit<-lm(GSPC~ N225, data=data.frame(coredata(sym_frame)))
summary(fit)
fitParam<-coef(fit)

p+geom_abline(intercept=fitParam[1], slope=fitParam[2],colour='red',size=2)+geom_smooth(method='loess',size=2,colour='blue')

### MINE results
library("rJava")
setwd('/home/self/Desktop/MINE/')

write.csv(data.frame(coredata(sym_frame)),file="GSPC_N225.csv",row.names=FALSE)
source("MINE.r")
MINE("GSPC_N225.csv","all.pairs")

##########################################################

The referenced paper is, "Detecting Novel Associations in Large Data Sets"
David N. Reshef, et al.
Science 334, 1518 (2011)


As an aside, I've been hooked on a sitcom series called, "Numb3rs," playing on Amazon Prime. It's about an FBI agent who gets assistance from his genius brother, a professor of Mathematics at a prestigious University. So far, they've discussed markov chains, bayesian statistics, data mining, econometrics, heat maps, and a host of other similar concepts applied to forensics.

11 comments:

  1. Interesting approach. I doubt you've had time to fully explore, but have you found it useful in a financial setting, with MIC closer to 1.0?

    ReplyDelete
  2. tr8dr,
    Thanks for stopping by. I haven't done too much exploration at all, outside of getting up and running. At some point, I plan to run a much larger input field, but suspect the noise issue will be a problem (esp. for financial data), so I suspect some type of filtering will be required to get much higher MIC levels. In the example illustration, it was already necessary to choose something like price data, rather than changes (which would have a much more noisy relationship), although that's much less useful.

    ReplyDelete
  3. Hey there, that's very interesting. Tried running through the R example but the last command MINE(...) returns an error: Error in .jfindClass(as.character(class)) : class not found

    Have you seen that before? Thanks

    ReplyDelete
  4. Actually, found the mistake (did not place the jar file in the same directory). Nevermind

    ReplyDelete
  5. Presumably the MIC score is relative, and not directly comparable to r^2. Shouldn't you be looking at a series of pairwise comparisons and using the MIC to order them? I think that would be more in the spirit of what the statistic was intended for.

    ReplyDelete
  6. Harlan,

    Although it's not directly comparable, both are utilised to measure strength of structural relationships between series. I only wanted to illustrate a replicable example of one such comparison with financial series as a starting point for readers to experiment with. But, yes, I agree, ultimately it is intended for a very large field of data and ranking the results would (hopefully) narrow down the search space of where to start looking for potentially strong relationships.

    Also, for viewers that want to start experimenting with several large datasets, there are a few examples included on the MINE page (baseball and DNA attributes, for example).

    IT

    ReplyDelete
  7. Interesting. However, while browsing for more information I found a paper (
    http://www-stat.stanford.edu/~tibs/reshef/comment.pdf) arguing that MINE has serious problems in noisy environments, which could cause problems for finance applications.
    The authors point to another new distance correlation measure, ‘Brownian distance covariance’ created by SzĂ©kely and Rizzo, which might be interesting to look at.

    ReplyDelete
  8. Thanks Hugin,

    I tried to emphasise the noise issue in the write-up and had seen that paper before. But as I mentioned, it might be possible to simply clean up the signal as a pre-processing method to improve performance.

    I don't recall seeing that other measure though; I'll take a look.

    IT

    ReplyDelete
  9. Noise is probably always a problem and it also depends how you define noise vs signal. IMO, even with the alleged drawbacks MINE/MIC is interesting enough to look at, thank you for bringing it up.

    ReplyDelete
  10. It would be traditional to do this type of comparison on returns, not prices. Also, the R code compares price to price through time and loses all the time-dependence. Don't you think that the relationship of two prices in 1992 might be different from the relationship of two prices in 2012? I would suggest that you are likely to get much better results if you keep the time-dependence, as you would with a linear model on time series data.

    ReplyDelete
  11. Hi Brian,

    Yes, I agree, it is appropriate and traditional to do the comparison on returns. I purposely used raw prices mainly to illustrate the effect of doing a type of classical linear to non-linear (MINE) regression comparison with some type of strong signal to noise dependency in both cases (i.e. illustrating the strength of the non-linear based approach that MINE should excel at).

    That would not be the case in most return based inter-market dependencies. Had I used raw returns, I would expect both to return very poor fits.

    Second, as I pointed out, the example was directly taken from M. Katsanos' book on 'intermarket trading strategies,' as a fairly simple illustration of intermarket relationships.

    Also, the dependency here is not on time-- it is the price relationship only between two distinct market series. If we wanted to model the direct effect of time, we could add it as an additional factor, though that is not common. A better and more useful approach is to simply model one series vs. the n bar lagged version of the other (whether return or price based) or model correlations across time (which we know aren't very stable).

    But again, the sole purpose of the code was only for a simple concrete illustration of running MINE with R. Your comments and suggestions are well taken and appreciated.


    Thank you,
    IT

    ReplyDelete