Wednesday, August 22, 2012

The Kaggle Bug

If you have any interest in data mining and machine learning, you might have already caught the Kaggle bug.

I myself fairly recently got caught up in following the various contests and forums after reading a copy of "Practical Time Series Forecasting," -- 2nd edition, by
Galit Shmueli. What makes the contests great are that they allow any ambitious and creative data scientist or amateur enthusiast to participate in and learn a wealth of new knowledge and tricks from more experienced professionals in the field.

What should make it even more interesting to readers here is considering that many of the winners that participate in these high purse contests are often from the financial world. Take one of my personally inspirational traders, Jaffray Woodriff, hedge fund manager of well-known machine learning oriented hedge fund, Quantitative Investment Management (better known by its acronym - QIM). I had mentioned recently to a surprised friend, that Mr. Woodriff had also participated in the more well-known Netflix prediction contest (having been a member of the third-place team at one point).

In particular, the most recent contest that has many eager followers watching is the $3,000,000 Heritage Provider --Heritage Health Prize Competition, which is an open contest to predict likelihood of patient hospital admission. What particularly inspired this blog post is a very useful blog from one of the leading contestants, Phil Brierley a.k.a. handle, Sali Mali, who has interestingly joined with the marketmaker team, also affiliated with a prediction related fund. Mr. Brierley has shared tremendously useful insights about practical methods of attacking the problem-- all the way from SQL preprocessing and cleaning to intuitive visualization methodologies. I applaud him for his generous sharing of insights to the rest of the predictive analytics community.  Although he hasn't posted in a while, his journal of thoughts are still highly useful.

Anyone looking for grubstake could certainly use three million to get started=)

Below are the specific links mentioned…

http://anotherdataminingblog.blogspot.com/
http://www.heritagehealthprize.com/c/hhp
http://www.kaggle.com/

...and one newer from stack exchange
 http://blog.stackoverflow.com/2012/08/stack-exchange-machine-learning-contest/?cb=1