Friday, October 26, 2012

Book Review: R for Business Analytics, A Ohri


      I've added a recently released book to my list of recommendations (at the amazon carousel to the right), as I've reviewed a copy provided to me via Springer Publishers. The book is R for Business Analytics, authored by A Ohri.  Mr. Ohri provides us with a brief background of his own journey as a business analytics consultant, and shares how R helped complement his work with a very low cost (time to learn the software) and very large benefits.  At the outset, he emphasizes that the book is not geared towards statisticians, but more towards practicing business analytics professionals, MBA students, and pragmatically oriented R neophytes and professionals alike. In addition, there is a focus on using GUI oriented tools towards assisting users in quickly getting up to speed and applying business analysis tools (Rattle, for example, is covered as an alternative to Weka, which has been covered here previously).  In addition, he provides numerous interviews with well known company representatives who have either successfully integrated R into their own development flow (including JMP/SAS, Google, and Oracle ), or found that large groups of customers have utilized R to augment their existing suite of tools.  The good news is that many of the large companies do not view R as a threat, but as a beneficial tool to assist their own software capabilities.

     After assisting and helping R users navigate through the dense forest of various GUI interface choices (in order to get R up and running), Mr. Ohri continues to handhold users through step by step approaches (with detailed screen captures) to run R from various simple to more advanced platforms (e.g. CLOUD, EC2) in order to gather, explore, and process data, with detailed illustrations on how to use R's powerful graphing capabilities on the back-end.

     The book has something for both beginning R users (who may be experienced in data science, but want to start learning how to apply R towards their field), and experienced R users alike (many, like myself, may find it useful to have a very broad coverage of the myriad number of packages and applications available, complemented by quickly accessible tutorial based illustrations).  In summary, the book has an extremely broad coverage of R's many packages that can be used towards business data analysis, with a very hands on approach that can help many new users quickly come up to speed and running on utilizing R's powerful capabilities. The only potential down-side is that covering so many topics, comes at a cost of sacrificing some depth and mathematical rigor (leaving the door open for readers to pursue several more specialized R texts).

Wednesday, August 22, 2012

The Kaggle Bug

If you have any interest in data mining and machine learning, you might have already caught the Kaggle bug.

I myself fairly recently got caught up in following the various contests and forums after reading a copy of "Practical Time Series Forecasting," -- 2nd edition, by
Galit Shmueli. What makes the contests great are that they allow any ambitious and creative data scientist or amateur enthusiast to participate in and learn a wealth of new knowledge and tricks from more experienced professionals in the field.

What should make it even more interesting to readers here is considering that many of the winners that participate in these high purse contests are often from the financial world. Take one of my personally inspirational traders, Jaffray Woodriff, hedge fund manager of well-known machine learning oriented hedge fund, Quantitative Investment Management (better known by its acronym - QIM). I had mentioned recently to a surprised friend, that Mr. Woodriff had also participated in the more well-known Netflix prediction contest (having been a member of the third-place team at one point).

In particular, the most recent contest that has many eager followers watching is the $3,000,000 Heritage Provider --Heritage Health Prize Competition, which is an open contest to predict likelihood of patient hospital admission. What particularly inspired this blog post is a very useful blog from one of the leading contestants, Phil Brierley a.k.a. handle, Sali Mali, who has interestingly joined with the marketmaker team, also affiliated with a prediction related fund. Mr. Brierley has shared tremendously useful insights about practical methods of attacking the problem-- all the way from SQL preprocessing and cleaning to intuitive visualization methodologies. I applaud him for his generous sharing of insights to the rest of the predictive analytics community.  Although he hasn't posted in a while, his journal of thoughts are still highly useful.

Anyone looking for grubstake could certainly use three million to get started=)

Below are the specific links mentioned…

http://anotherdataminingblog.blogspot.com/
http://www.heritagehealthprize.com/c/hhp
http://www.kaggle.com/

...and one newer from stack exchange
 http://blog.stackoverflow.com/2012/08/stack-exchange-machine-learning-contest/?cb=1


 

Wednesday, May 30, 2012

The Facebook Doomsday Watch

I've been following the myriad circus of Facebook commentators and bystanders pointing to its horrific failed IPO launch and seemingly inevitable crash to zero. While my focus here isn't really so much on fundamentals or basic TA; I do want to comment on some subjective thoughts on the matter as well as illustrate one catchy graphic I put together.

Fig 1. FB IPO drawdown (with potential trajectory) vs. EBAY historical IPO opening (a.k.a the U-TURN pattern).

Having lived through and experienced the many ballyhooed IPO juggernauts of the past, I can't help but think back to how overvalued I 'felt' stocks like Ebay and Google felt to me at the outset.  We all know that that we can't directly compare such small samples in any statistical manner with much conviction, but that qualitative sense in me feels that Facebook is one of those Wall Street darlings we rarely encounter and wish we could go back and buy at a discount.  Sure there were the megaflops (Blackstone, The Globe, etc) that never revived quite back, but then again consider the Lynch method of buying (...are the masses using it?), the massive institutional support available, the number of shorts that are sure to pile on, and more importantly, the nagging fact that it is consistently one of the highest viewed websites of all (typically above or next to Google and Baidu -- don't believe me, check Alexa).  Ok, but enough of the soapbox on those biased musings-- one quantitative comparison to consider in the chart above is how Ebay fared at the outset and was also lambasted as a failure throughout Internet chat-rooms and various media pundits.  What I have graphed is a drawdown for both (with adjusted ebay quotes) relative to the 1st day open bid.  The last points on Facebook are potential drawdown (relative to IPO opening price) trajectories  at

28.84 -31.41% (yesterday's close)
26.5 -36.98%
24 -42.93%
23 -45.30%























































So, I leave you with that as food for thought.  I don't often discuss my thoughts about semi qualitative opportunities, but then again, we don't get these types of juggernaut long term opportunities all that often.*  As always, please make your own informed decisions, and I'll try to get back on topic...  one of these days.

* Two other counter points (amongst many excluded) that I'm sure some more astute observers will note.
1) Ebay IPO U-Turn occurred during the mega bull run dot com mania.
2) If the Greece (or insert any suitable catalyst here) fiasco escalates into a fat tail flight to safety avalanche (of which I pointed out have been exceedingly abundant of late); then keep in mind Facebook and any other equity leaders should be expected to plunge together; hence, the emphasis on LONG term  portfolio component opportunity.


Tuesday, February 28, 2012

Expanding Visualization of published system edges (R)

I happened to be looking over a revised text of a systems author I happen to follow. I will be a bit vague about specifics, as the system itself is based on well known ideas, but I'll leave the reader to research related systems.  The basic message illustrated in this post, is that I often make an effort to look at different viewpoints of system related features that are not always explored in the texts.

Fig 1. BarGraph Illustration showing 0.48% average weekly gain given conditional system parameters,  over arbitrary trades giving 0.2% average return per week over same 14 yr. period.

For example, the following system is based upon buying at pullbacks of a certain equity series and holding for a week.  In the book, a bargraph is shown illustrating the useful edge of about 0.48%/trade vs. simply buying and holding for 0.2%/trade.  Although, the edge is useful and demonstrated well in the bargraph illustration, it can be useful to look at the system performance from various different perspectives.

As an example, we might wonder how the system unfolded over time.  In order to look at this, we can plot a time series representation of the system's equity curve (assuming 100% compounding, no slippage, and no fractional sizing).  The curve is shown compared to a straw-broom plot of 100 monte carlo simulation paths of the true underlying data, comprised of randomly and uniformly selected data over the same period.


Figure 2. Plot of edge based equity path vs. simulated Monte Carlo Straw-Broom Plots of randomly selected series based upon true underlying data.

Looking at the 'unwinding' of the actual system's 14yr. time series path, we can make a few observations.

1) The edge, in terms of terminal wealth only, far outperforms several randomly simulated data paths built from the actual instrument.
2) Unfortunately, the edge also has very wild swings and variation (resulting in a very large drawdown).

Had we blindly selected the system itself based upon the bar graph alone, it's very possible, that we could have entered at the worst possible time (just prior to the drawdown).  It also illustrates an issue I personally have with using simple monte carlo analysis (with IID assumptions) as a proxy to underlying data. Namely, that the auto-correlation properties have been filtered out, making the system based edge appear much better in comparison.  I have spent a lot of time thinking about ways to deal with this issue; but it's a discussion for another day.  But it's not necessarily a bad result at all. Rather it gives us some features (persistence and superior edge/trade) that we can use as a spring-board for further optimization; for example, we might think about adding a conditional filter to mitigate the large drawdown based upon underlying features that may have coincided during that period.

Data was plotted using R ggplot2. Although I think the plotting tool is excellent, I find that the processing time is a bit consuming.



Tuesday, January 31, 2012

MINE: Maximal Information-based NonParametric Exploration


There was a lot of buzz in the blogosphere as well as the science community about a new family of algorithms that are able to find non-linear relationships over extremely large fields of data. What makes it particularly useful is that the measure(s) it uses are based upon mutual information rather than standard pearson's correlation type measures, which do not capture non-linear relationships well.

The (java based) software can be downloaded here: http://www.exploredata.net/ In addition, there is the capability to directly run the software from R.

 Fig 1. Typical non-linear relationship exemplified by intermarket relationships.

The algorithm seems promising as it would allow us to possibly mine very large data sets (such as financial intermarket relationships) and find potentially meaningful non-linear relationships. If we were to use the typical pearson's correlation measures, such relationships would show very small R^2 values, and thus be discarded as non significant relationships.

I decided to take it for a spin on an example of a non-linear example, taken from M. Katsanos' book on intermarket trading strategies (p 25. fig 2.3). In figure 1, we can clearly see that the relationship between markets is non-linear, and thus the traditional linear fit returns a low R^2 value of .143 (red line), a loess fit is also shown in blue. After running the same data through MINE, the results returned in a .csv file, were...


MIC           (strength)    MIC-p^2          (nonlinearity)
0.16691002    0.62445            7.129283    -0.3777441


The MIC (Mutual Information Coefficient) of .167 was not much greater than theR^2 measure of .143 above. However, one of the mentions in the paper was that as the signal becomes more obscured by noise, the MIC will degrade comparably.

The next step would be too find some type of fit to minimize the noise component and make updated comparisons.

In order to show a better illustration of how useful it might be, I am attaching a screenshot of the reference material here.



Figure 2. Reproduced from Fig 6. 'www.sciencemag.org/cgi/content/full/334/6062/1518/DC1'

Notice the MIC Score measure outperforms other traditional methods on many non-linear structural relationships.

Here is the full R-Code to repeat the basic experiment.
###############################################
# MINE example from intelligenttradingtech.blogspot.com 1/31/2012

library(quantmod)
library(ggplot2)

getSymbols('^GSPC',src='yahoo',from='1992-01-07',to='2007-12-31')
getSymbols('^N225',src='yahoo',from='1992-01-07',to='2007-12-31')

sym_frame<-merge(GSPC[,6],N225[,6],all=FALSE)
names(sym_frame)<-c('GSPC','N225')

p<-qplot(N225, GSPC, data=data.frame(coredata(sym_frame)),
geom=c('point'), xlab='NIKKEI',ylab='S&P_500',main='S&P500 vs NIKKEI 1992-2007')

fit<-lm(GSPC~ N225, data=data.frame(coredata(sym_frame)))
summary(fit)
fitParam<-coef(fit)

p+geom_abline(intercept=fitParam[1], slope=fitParam[2],colour='red',size=2)+geom_smooth(method='loess',size=2,colour='blue')

### MINE results
library("rJava")
setwd('/home/self/Desktop/MINE/')

write.csv(data.frame(coredata(sym_frame)),file="GSPC_N225.csv",row.names=FALSE)
source("MINE.r")
MINE("GSPC_N225.csv","all.pairs")

##########################################################

The referenced paper is, "Detecting Novel Associations in Large Data Sets"
David N. Reshef, et al.
Science 334, 1518 (2011)


As an aside, I've been hooked on a sitcom series called, "Numb3rs," playing on Amazon Prime. It's about an FBI agent who gets assistance from his genius brother, a professor of Mathematics at a prestigious University. So far, they've discussed markov chains, bayesian statistics, data mining, econometrics, heat maps, and a host of other similar concepts applied to forensics.

Sunday, January 1, 2012

Free Online Stanford Machine Learning Course: Andrew Ng. Post Mortem.

Happy New Year to all the viewers of this blog and just a short reminder that the course will be available again this January.
http://www.ml-class.org/course/auth/welcome

Having audited the course, I would highly recommend it to anyone who is interested in a very hands on learning session covering many of the topics I've posted about (and many other areas, such as how to deal with over/under fitting). Kudos to Dr. Ng for a fantastic, engaging, and informative course.

As an added incentive, users will become familiarized with many vectorized approaches to programming (via Octave), which are very useful in languages such as Python and R.