## Friday, October 14, 2011

### Free auditing of Stanford AI and Machine Learning Courses w/Peter Norvig

Just wanted to notify viewers of a few great courses that are being offered free for auditing and/or participation by well known industry experts, including co-author of the classic text on AI, 'Artificial Intelligence: A Modern Approach,' Peter Norvig and Prof. Andrew Ng.

http://www.ai-class.com/
http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/10/14/BUFR1LH9JR.DTL

The notice is a bit late, but they are still accepting registrations.

## Thursday, October 6, 2011

### Spatio-Temporal Data Mining: 2

There are many visual methods used to identify patterns in space and time. I've discussed some in prior threads and will show a few others briefly here. One of the most difficult questions I often hear from others regarding markov type approaches, is how to identify states to be processed.

It is a similar problem that one encounters using simple linear type factor analysis. Unfortunately, there is no simple answer; however, because data streams are becoming so vast it becomes almost impossible to enumerate over all possible state sets. Visual mining techniques can be incredibly helpful in narrowing down that space as well as feature reduction.  I often use these types of visualizations back and forth with unsupervised classification type learners to converge on useful state identifications.

Fig 1. Spatio-Temporal State plot

Figure 1 gives an idea on visualizing states with respect to time. But having such knowledge in isolation doesn't give us much use. We are more interested in looking for Bayesian type relationships between states that give some probabilities between linked states in time.

Fig 2. Fluctuation Plot

Several visual methods exist to capture the relationships visually. One common plot used in language processing and information theory, is a fluctuation plot. The above plot was built using the same state data as the first graph. It is often used to determine conditional relationships between symbols such as alphabet tokens. The size of each box is directly proportional to the weight of the transition probabilities between row and column states in tabular data. An example would be to think of the letters yzy more commonly followed by g (as in syzygy) than any other state token; thus, one would expect to quickly spot a larger box across a row of states representing the 'yzy' row token n-gram and 'g' column token .

Both plots were produced in R.  ggflucuation() is a plot command utilized from ggplot2.  I am currently investigating how much easier and faster it might be to process such visualizations in tools like protovis and processing.  I've been especially inspired by reading some of Nathan Yau's excellent visualization work in his book, 'Visualize This.' I included it in the link to the right for interested readers.

## Friday, September 23, 2011

### Arc Diagram and spatiotemporal data mining visualization

I won't spend too much time discussing this fascinating topic other than to say it relates very much to prior discussions about pattern discovery via visual data mining (see lexical dispersion plots for example).  I happened across an interesting visualization method called the Arc Diagram, developed by Martin Wattenberg. Working for data visualization groups at IBM and later Google, he developed some interesting visual representations of spatiotemporal data.

Fig 1. Arc Diagram and legend with example of discretized pattern archetype.

The resulting plot generates some fascinating temporal signatures, similar to what one might see in  phase-space portraits from chaos. However, they have been frequently utilized to look for spatiotemporal signatures in music.  One might discern a type of underlying order or visual signature of complexity as well as recurring patterns in sequential objects ranging from text based lyrical information to musical sheet notes.

Figure 1 shows an example of how one might utilize this tool towards temporal pattern discovery in time series. A weekly series from SPY has been discretized into alphabet tokens, based upon the bin ranges in the included legend. The small chart in the example would decode an archetypal pattern for the following sequence: ECDCECCD, into a time series representation of the 8 week data symbol. The following interactive java tool from another blogger, Neoformix, was then used to translate the data into an Arc Diagram.  http://www.neoformix.com/Projects/DocumentArcDiagrams/index.html  .  Read from top to bottom, one can look at recurring and related patterns that are repeated over time; certain behavior might warrant further investigation.

You can copy the following data stream into the tool to toy around with the tool to get a feel for the possibilities of visual pattern discovery.*  I won't go into too much more detail about utilizing it, other than to say it appears to be a very useful tool in temporal based pattern discovery.

Please see the following for more ideas on arc diagrams and musical signatures:
http://www.research.ibm.com/visual/papers/arc-diagrams.pdf

http://turbulence.org/Works/song/mono.html

Blog mentioned:
http://www.neoformix.com/

* Not sure how to attach .xls file here, but if anyone wants a copy of the .xls file, you can send me an email and I'll try to get it out to you.  Otherwise, you can simply grab a song lyric off the web to play with the tool.

## Thursday, August 4, 2011

### Aug 4, 2011 "plunge" headlines are in the air tonight

Today's financial headlines are littered with the word 'plunge.'  Considering today's (cl-cl) drop on the S&P500 was just about -5%, I don't know that I would exactly call that a plunge.

Fig 1. Historical ts plot of S&P500 returns <= -5%

The following R code produced a time series plot of historical occasions where this occurred.

###################################################

library(quantmod)

getSymbols("^GSPC",from="1950-01-01",to="2012-01-01")
adj<-GSPC$GSPC.Adjusted rtn<-(adj/lag(adj,1)-1)[2:length(adj)] r05<-rtn[rtn<= -.05] plot(sort(r05),type='o',main='S&P500 1950-present returns <= -5%') ################################################### Although the frequency of such occurrences is arguably rare, the 1987 drop is much more worthy of the 1 day label 'plunge.' One other disturbing observation in the data, however, is the large temporal clustering of occurrences in the recent 2008 region. Now that's behavior to be concerned about (not to mention revised flash crash data pts.). filtered 1 day cl-cl returns <=-5% sorted by date ## Thursday, July 28, 2011 ### Pattern Recognition: forward Boxplot Trajectories using R Although the following discussion can apply to the Quantitative Candlestick Pattern Recognition series, it is addressing the same issue as any basic conditional type system -- how and when to exit. The following is one way to visualize and think about it, and is by no means optimal. Fig 1. Posterior Boxplot Trajectory Often we attempt to find some set of prior input patterns that leads to profitable posterior outcomes. However, in most of the available examples, we are typically only given heuristics and rules of thumb on where to exit. This might make sense, since no one can accurately predict where to exit. However, with knowledge of past samples, we can have some idea of where a good target to exit might be, given the prior knowledge of forward trajectories. I dubbed the name 'boxplot trajectory', here, as I think it's a useful way to visualize a group of many possible outcome trajectories for further analysis. In this example, a set of daily price based patterns was analyzed via a proprietary program I wrote in R, which resulted in an input pattern yielding a set of 52 samples that met my conditional criteria. Fig 1 illustrates a way to look at the trajectory outcomes based upon one of the profitable patterns in the conditional criteria. The bottom graph is simply the plot of median results of each data point in the trajectory. We often try to imagine the best way to exit without foreknowledge of the future (and somewhat less rule of thumb based criteria). Fig 2. Trajectory tree. One approach would be to use some type of exiting rule based upon the statistical median of each sequential point's range. Knowing that 1/2 of the vertices occur above and 1/2 below the median, we should expect to hit at least 1/2 of the targets at or above the median. Given that the 3rd point is the highest median, it makes sense to exit earlier than waiting for a greater gain further out (which has an even lower median). So if we take as a target, the median value of the 3rd pt. we achieve an average and fixed target of 1.59% on 27/52 of the total samples. Of the remaining samples, we may now wish to exit on the 11th bar (or earlier if the same target is hit earlier) target of .556%, which is achieved on 13/52 of the remaining samples. This leaves only the last bar of which we simply use the average return as the weighted return value for that target, in this case -1.74% for the remaining samples : 12/52. Notice we will always have the worse contenders that were put off until the end. The expectation yields E(rtn)=27/52*.0159+13/52*.0056-12/52*-.017 =.0057 eeking out a small average + gain of .57%. Compounded, this gives: (1+.0159)^27*(1+.0056)^13*(1-.017)^12~ 34% rtn for 52 trades, each less than 3 days in length. Hit rate (as secondary observation) is 77% in this case. The approach is particularly appealing for a high frequency strategy with very low commissions. Notice it's by no means comprehensive (and yes, I've only shown in sample here), but rather a novel way to think about exiting strategies. ## Tuesday, May 17, 2011 ### Simulating Win/Loss streaks with R rle function The following script allows you to simulate sample runs of Win, Loss, Breakeven streaks based on a random distribution, using the run length encoding function, rle in R. Associated probabilities are entered as a vector argument in the sample function. You can view the actual sequence of trials (and consequent streaks) by looking at the trades result. maxrun returns a vector of maximum number of Win, Loss, Breakeven streaks for each sample run. And lastly, the prop table gives a table of proportion of run transition pairs from losing streak of length n to streak of all alternate lengths. Example output (max run length of losses was 8 here): 100*prop.table(tt) lt.2 lt.1 1 2 3 4 5 6 7 8 1 41.758 14.298 5.334 1.662 0.875 0.131 0.000 0.044 2 14.692 4.897 1.924 0.787 0.394 0.087 0.131 0.000 3 4.985 2.405 0.525 0.350 0.000 0.000 0.044 0.000 4 1.662 0.875 0.306 0.087 0.000 0.000 0.000 0.000 5 0.831 0.219 0.175 0.000 0.000 0.044 0.000 0.000 6 0.087 0.131 0.044 0.000 0.000 0.000 0.000 0.000 7 0.087 0.087 0.000 0.000 0.000 0.000 0.000 0.000 8 0.044 0.000 0.000 0.000 0.000 0.000 0.000 0.000 maxrun B L W 3 8 17 ----------------------------------------------------------------------------------------- #generate simulations of win/loss streaks use rle function trades<-sample(c("W","L","B"),10000,prob=c('.6','.35','.05'),replace=TRUE) traderuns<-rle(trades) tr.val<-traderuns$values
maxrun<-tapply(tr.len,tr.val,max)

lt<-tr.len[which(tr.val=='L')]
lt.1<-lt[1:(length(lt)-1)]
lt.2<-lt[2:(length(lt))]

#simple table of losing trade run streak(n) frequencies
table(lt)

#generate joint ensemble table streak(n) vs streak(n+1)
tt<-table(lt.1,lt.2)
#convert to proportions
options(digits=2)
100*prop.table(tt)
maxrun

## Tuesday, May 10, 2011

### High Low Clustering on intraday high frequency sampled data

Nothing unusually exciting on this post, but I happened to be engaged in some particle based methods recently and made some simple visual observations as I was setting up some of the sampling environment in R.  I am also using Rkward and Ubuntu to generate, so I'm gathering everything from the current environment (including graphics).

Fig 1. Parallel plot of half hr sample of High and Low intraday data points vs time (Max is purple dots, Min are red). Fig 2. Cumulative count of high low events per interval (blue = total high and low).

The plot illustrates sampled intraday data at half hour increments.
The highs and lows of each sample interval are overlaid using purple to denote an intraday high and red to denote an intraday low.
Interesting points of observation are--

1) The high and low samples tend to be clustered at open, midday, and close.
2) High and low events do not appear to be uniformly and randomly distributed over time.
This kind of data processing is useful towards generating, exploring, and evaluating pattern based setups.

The study is by no means complete or conclusive, just stopping by to show more of the type of data processing and visual capabilities that R is capable of.   If anyone has done any more conclusive studies I'd be interested to hear.

P.S. If anyone notices any odd changes, for some reason Google was having some issues the last few days, and it appears to have reverted to my original (not ready to launch) draft.

## Tuesday, March 8, 2011

### Can one beat a Random Walk-- IMPOSSIBLE (you say?)

Firstly, apologies for the long absence as I've been busy with a few things.  Secondly, apologies for the horrific use of caps in the title (for the grammar monitors).  Certainly, you'll gain something useful from today's musing, as it's a pretty profound insight for most (was for me at the time). I've also considered carefully, whether or not to divulge this concept, but considering it's often overlooked and in the public literature (I'll even share a source), I decided to discuss it.

Fig 1. Random Walk and the 75% rule

I've seen the same debate launched over and over on various chat boards, which concerns the impossibility of theoretically beating a random walk.  In this case, I am giving you the code to determine the answer yourself.
The requirements: 1) the generated data must be from an IID gaussian distribution 2) series must be coaxed to a stationary form.

The following script will generate a random series of data and follow the so called 75% rule which says,
Pr[Price>Price(n-1) & Price(n-1) < Price_median] Or [Price < Price(n-1) & Price(n-1) > Price_median] = 75%.  This very insightful rule (which is explained both mathematically and in layman's terms in the book 'Statistical Arbitrage' linked on the amazon box to the right), shows that given some stationary, IID, random sequence that has an underlying Gaussian distribution, the above rule set can be shown to converge to a correct prediction rate of 75%!

Now, we all know that market data is not Gaussian (nor is it commision/slippage/friction free), and therein lies the rub. But hopefully, it gives you some food for thought as well as a bit of knowledge to retort, when you hear the debates about impossibilities of beating a random walk.

R Code is below.

##################################################
#gen rnd seq for 75% RULE

#generate stationary rw time series
rw<-rnorm(100)

m<-median(rw)