Friday, November 19, 2010

Finally! A practical R book on Data Mining: "Data Mining With R, Learning with Case Studies," by Luis Torgo

I've been a bit busy lately with a few big things, however, I wanted to stop by and mention a fantastic book for those who have been following along the R examples.  Anyone who's followed my blog knows that I'm big on practical books with examples.  There are also three main open source tools I've discussed with regards to prototyping trading systems: Weka, Python, and R. Of the three tools mentioned, I've been able to recommend Witten and Frank's book on Data Mining for Weka, and Stephen Marsland's book on Machine Learning as the Python bible for hands on Machine Learning.  Well now, I can thankfully complete the trinity, with Luis Torgo's new book, 'Data Mining with R, Learning with Case Studies.'

Both R novices and experts will find this a great reference for  Data Mining.  The opening chapter has a useful intro to get you started on R (Factors, Vectors, and Data Frames, as well as other useful objects are covered with examples).  Additional chapters cover both classification and regression type prediction schemes.

The most useful chapter to readers here, however, is the chapter on 'Predicting Stock Market Returns.'  Many of the readers who have been looking for example scripts on some of the topics I've covered, will find them here. Not only is gathering and processing data (CSV,  quantmod and yahoo finance, and MySQL) well covered, but various prediction and evaluation schemes (cross validation, sliding and growing windows, PerformanceAnalytics package) are discussed along with access to the author's code.  Many topics I haven't discussed yet are available here as well, including MARS (Multivariate Adaptive Regression Splines), SVMs, and various validation techniques along with handy tabulation of results.  Having read a previous draft, I'm still working into the examples, and welcome any feedback and thoughts I can address.

The book can be accessed via the amazon book showcase on the right and instructions for R code access are available in the book.

Tuesday, August 10, 2010

Conditioning Systems on Regime Variables

Here is a brief and simple example of switching systems based upon regime type (sometimes called gating).

I've brought up the idea of conditioning systems based upon regimes many times in past posts. Some texts call this filtering, although I prefer to use the term conditional gating. The simple idea is to turn on a certain system during certain conditions and either: switching systems, or simply tracking the underlying series during alternate conditions. In this case the gating condition is regime, which in turn is, is High or Low Volatility as measured by the VIX.  Although I'm not divulging the details of the underlying system itself, I've seen enough discussions in public domain to feel that other traders have picked up on the ideas demonstrated here.

 Fig 1. Terminal Wealth vs. VIX threshold

The animation below shows the system results during each step of the conditioning variable, the VIX. Notice the dramatic improvement at the value of 23. Also, notice as mentioned in earlier posts how the optimal switching point of 23 is the most robust value, since even if the OOS results are to the left or right of the optimal switching point, they will be the best local values over a wide range of dependency. The astute observer might have noticed that this system is simply tracking buy&hold during low vix regimes, while switching on system V, during the high regimes. It is evident that the terminal wealth simply tracks buy & hold after a certain value of VIX, since it is always locked on to tracking mode under a certain threshold.

The system is only shown in sample, however, I've found it to be pretty successful OOS as well.

Video 1. Stepping the Equity Curve system through linear VIX range.

Monday, August 2, 2010

Quantitative Candlestick Pattern Recognition (Part 2 -- What's this Natural Language Processing stuff?)

I wanted to briefly add one more thought regarding the temporal nature of probabilities as was alluded to in my correspondence with Adam, as well as the prior closing comments on the Chaos post (structure coalescing and dispersing).

I will borrow from the field of Natural Language Processing and introduce one common visual description of how the states evolve over time using something called a Lexical Dispersion Plot.

Fig 1. Lexical (cluster state vocabulary) Dispersion Plot of Clustered Candlestick States over time

In studies of language, we are often interested in observing how statistical patterns and relationships of sounds, characters, and words, evolve over time.  Natural Language Processing is an entire field that has been dedicated to finding proper tools and vernacular to describe such statistics.  The idea of using a lexical dispersion plot, is to observe how the lexicon itself evolves over time.  To give a simple example, we might take a corpus of common pop culture texts borrowed from some library, and look at the occurrence of the following three word states; "spider", "man", and "spider man".  The first two terms are isolated words, and the third term is called a bigram, which is a joint occurrence of two states in sequential order. 

Now, although I haven't created the proposed lexical dispersion plot for the above scenario, one could reasonably expect for the number of occurrences of the single words, spider and man, to be relatively frequent and uniform from about 1900 to say 1960, while the joint pair (spider,man), might occur relatively sparsely. However, beyond the 60s, we would notice an increase in the joint pair (spider,man) as the popularity of the fictional character began to grow in popularity in the collective pop consciousness.  We might also expect a large frequency of the bigram to occur with recent popularity in the films. However, it's possible that a few hundred years later, that the joint term and character popularity might just wane and eventually die off, even though the two unimodal terms (spider and man) are still frequently observed.

Ok, so what's the point to this? Well, we are commonly taught in statistics that there is a population that exists to describe the ultimate best statistical model of any observational set that lies somewhat beyond the notion of time (much like Plato's ideas of forms existing behind the scenes to describe all nature over all time, for philosophy fans).

But one of the things that disturbed me earlier on is exactly what I described in the prior paragraph on the joint bigram of spider man, which is that sometimes we have to pragmatically shed some of our beliefs about 'ideal' populations and just try to observe statistical phenomena as it occurs temporally.  As mentioned in the Chaos quote, some patterns just spontaneously occur (spider man) for a while, then disappear over time. So, that the notion of a larger population existing behind the scenes (and all the statistical rigor associated with it), might be either overkill or even misleading towards our goal of trying to capture the essence of fleeting patterns. From a statistical viewpoint, I suppose I would lean more on the side of the Bayesian inference camp (constantly updating beliefs online, rather than the frequentist approach).

It's common knowledge in markets that financial time series are not IID (independent and independently distributed) over time. Rather, we accept that there are clusters of regions of behavior that tend to occur frequently together, and likewise, disappear over time (often reappearing again, though not always).  This body of knowledge, specifically related to volatility, is sensibly labeled as heteroscedasticity (differing variance) as opposed to homoscedasticity (constant variance) of observations. We might also notice such behavior being binned and quantified into certain 'regimes' of local stability.

Now, if any of the above meandering made any sense, I will describe how it relates to the Quantitative Candlestick Pattern Recognition article. Recall that using clustering, we were attempting to identify a vocabulary of states that best describe a limited set of features (in the example, six states were identified) that best partition related candlestick symbols by state in an unsupervised manner. However, the dispersion plot in Fig 1. shows that viewed from a perspective of a central population, these states are not uniformly distributed (IID) over time, rather, some tend to occur frequently over relatively long periods of time, while others appear and disappear for reasonable windows of time.  States one and two in the set tend to occur rather frequently, because they are very small moves (dojis and such), which tend to occur often over time. However, some of the larger moves captured in states 3 and 4, tend to persist for some periods, then disappear over other intervals.  The likely explanation, is that larger moves tend to be associated with volatility, which as we know, exhibits heteroscedasticity (clustering together in time). Keep in mind, the dispersion example is not only limited to single symbols over time, but can be extended to any number of n-gram pairs or symbols (such as the two word bigram state for spider man).

With that knowledge in mind, it doesn't always make a whole lot of sense to try to develop and require a central fixed body of pattern statistics and related models over long periods of time, or even require many related statistical tests as neccessary (things like n-fold cross validation over very large time series, bootstrap re-sampling methods with shuffling, and requiring decades of backtesting training data to obtain confidence that we found the best pattern vocabulary to describe data for all time). For instance, in one of the better books on statistics for traders, "Evidence based TA," by Aronson, many of the tests were conducted using t-tests of entire bodies of financial series and rules over a long period of time, while rejecting many potential pockets of temporal success since they were thrown in and bootstrapped with much longer periods of data to draw conclusions about statistical significance of hypotheses related to better than chance success.

This is not to say that common trading statistics should be thrown out; not at all. Instead, it is hopefully to try to look at how the information being evaluated is processed over time (for instance, we may look at long term statistics of trade results, but focus more on short term statistics and modelling of the underlying patterns they are dependent upon).

Additionally, we might be interested in breaking up the pattern information stream into smaller segments and observing and adapting to how the segments of data streams evolve and change over time. The key savior or benefit to us, is that these patterns in the data streams do tend to persist together for quite some time (often reasonably long), before dispersing and moving on to new forms of patterns.  There are several different machine learning concepts on the horizon that work with evolving (such as adding and pruning pattern model parameters) data streams over time and space. I have been spending some time evaluating one of them recently (although, I'm not saying which at the moment) which looks promising.

Tuesday, July 6, 2010

Chaos in the Financial Markets?

Over the years I've had quite a few interested individuals ask me about Chaos and its applications towards trading. Well, as hidden markov models and speech processing were made popular by James Simons and his team at Renaissance Technologies, one could trace much of the popularity of Chaos theory and its financial applications to Norman Packard and Doyne Farmer, two former physicists working in the area of complex systems. Much of their story is discussed in the book, 'the Predictors,' by Thomas Bass. The two were on the forefront of new research in areas of complexity and chaos and had decided to parlay their knowledge into applications towards the financial markets as they founded the company called Prediction Company in Santa Fe, New Mexico. The company was swallowed by UBS systems in 2005.

Fig 1.   2D slice of Lorenz Attractor Phase Space signature A.K.A. The butterfly effect.

We could spend a lot of time discussing various facets of Chaos, as it is a very large field with many different related fields (such as fractals and complexity). But I want to focus on outlining a very simple understanding of why the field seemed so fascinating to traders and quants alike. Most experts in time series run a slew of tests to demonstrate that markets exhibit no predictable order.
However, what makes Chaos so fascinating is that certain time series may seem to pass a battery of common statistical tests for randomness, yet are perfectly deterministic.

Chaos is a field of science that is engaged in studying non-linear dynamical behavior of systems. A popular example might be how different planetary bodies exhibit forces upon one another (see Poincare), or turbulent flow of various particles. Those engaged in studying such systems often like to observe signatures in a domain known as phase space or state space. Rather than look at the time series unfold over time, they are looking for a type of order that underlies the trajectories of the system state dynamics as it unfolds over time. The plot of the trajectory may show structural order that may appear random in the time domain. One of the most popular attractor signatures is the well known Lorenz attractor, which is better known to popular literature as the butterfly effect. Fig 1, displayed earlier, shows a 2d slice of the phase space trajectory, that you can run over at chaos applet. The famous signature displays a fascinating case of underlying order that relates to dynamic atmospheric convection in three dimensions.

A more simple and applicable model that we will look at for illustration purposes, is the well known Logistic Map (also known as quadratic map and Fiegenbaum map). This equation of a non-linear dynamical trajectory was investigated and attributed to Robert May, a biologist studying models of fish populations.

The recursive equation for the Logistic Map is: xnext = r*x(1-x)
Note that this is a feedback system which has some control over the dynamics of the system model by varying the value r. What you'll see if you plot it out vs. the control coefficient, r, is that the series moves from a stable system to one which bifurcates into periodic cycles; and as it approaches the value 4 it starts to behave chaotically. Chaotic behavior is aperiodic, which (like financial series) never repeats exactly; but (unlike financial series) has an underlying deterministic order. In order to see the beauty of chaos and how it exhibits determinism, let's first look at the time series.

Fig 2. Time Series of Logistic Map Equation.

Notice the 1st plot, which shows the 1st differenced time series, shows no signs of periodicity or determinism, nor does the cumulative walk display on the 2nd. However, if we look at the phase space signature of the same series in phase space, we see a completely different picture.

Fig 3. Phase State Plot of Logistic Map

Notice it is clearly deterministic in this figure. I.e. given any point on the x axis, we can easily determine the exact corresponding point one step into the future on the y axis.  This should be fairly obvious, since the equation we started out with xnext=r*x(1-x) = r*x-r*x^2 is a negative parabolic curve. However, many such time series do not have such a simple equation and must be tested in various ways to determine structural chaos. There are other issues to contend with as well, including sensitivity to initial conditions and divergent trajectories due to finite computational precision.

Ok, now that we understand all the hoopla about Chaos and a seemingly random signal having an underlying deterministic signature, what about a common financial time series?

Fig 4. Typical Random Walk (Financial) Time Series.

The Random Walk shows no discernible order, nor periodicity; similar to the logistic equation series. But what about if we observe the phase space trajectory?

Fig 5. Phase Space trajectory of random walk.

No Dice. Notice that the random walk shows zero determinism, hence the gaussian nature.  There are numerous methods to display higher order dimensional metrics as well (correlation lag plots, Lyapunav exponents, etc), and other than something called the compass rose, I've not personally seen much evidence of deterministic chaos in raw financial time series. Incidentally, the phase plot here is equivalent to a lag one scatterplot of returns for those more familiar with finance related statistics.

A last note on the Prediction Company, is that there is an often referenced paper on the mackey-glass equation
(a non linear dynamic model of blood flow) by Meyer and Packard, whereby they used genetic algorithms to find underlying conditional order rule sets for the series. 

I'll end with perhaps the best excerpt from 'the Predictors,' which echoes much of my own focus and discoveries over the last decade...

"One of the fundamental truths about the markets is that the dynamics are nonstationary," Norman explains. "We see no evidence for the existence of an attractor with stable statistical properties. This is what characterizes chaos -- having an attractor with stable statistical properties-- so what we are seeing is not chaos. It is something else. Call it an 'even-stranger-than-strange attractor,' which may not really be an attractor at all.

The market might enter an epoch where some structure coalesces and sits there in a statistically stationary pattern, but then invariably it disappears. You have clouds of structure that coalesce and evaporate, coalesce and evaporate. Prediction Company's job is to find those pieces of structure that have the strongest signal and persist the longest. We want to know when the structure is beginning to emerge or dissolve because, once it begins to dissolve, we want to stop betting on it."

...excerpt from the Predictors, Thomas A. Bass (1999).

Thursday, June 10, 2010

Quantitative Candlestick Pattern Recognition (HMM, Baum Welch, and all that)

Fig 1. Clustering based approach to candlestick Pattern Recognition.

I've been reading a book titled, 'the Quants,' that I'm sure will tantalize many traders with some of the ideas embedded within. Most notably (IMO), the notion that Renaissance's James Simons, hired a battery of cryptographers and speech recognition experts to decipher the code of the markets. Most notable of the hired experts was, leonard Baum, co-developer of the Baum-Welch algorithm; an algorithm used to model hidden markov models. Now while I don't plan to divulge full details of my own work in this area; I do want to give a brief example of some methods to possibly apply with respect to these ideas.

Now most practitioners of classical TA have built up an enormous amount of literature around candlesticks, the japanese symbols used to denote a symbolic formation around open, high, low, and close daily data. The problem as I see it, is that most of the literature that is available only deals with qualitative recognition of patterns, rather than quantitative.

We might want to utilize a more quantitative approach to analyzing the information, as it holds much more potential than single closing price data (i.e. the information in each candle contains four dimensions of information). The question is, how do we do this in a quantitative manner?

One method to recognizing patterns that is well known is the supervised method.  In supervised learning, we feed the learner a correct list of responses to learn and configure itself from; over many iterations, it comes up with the most optimal configuration to minimize errors between the data it learns to identify, and the data we feed as examples.  For instance, we might look at a set of hand-written characters and build a black box to recognize each letter by training via a neural net, support vector machine, or other supervised learning device. However, you probably don't want to spend hours classifying the types of candlesticks by name. Those familiar with candlesticks, might recognize numerous different symbols; shooting star, hammer, doji, etc... Each connotating a unique symbolic harbinger of the future to come. From a quantitative perspective, we might be more interested in understanding the bayesian perspective; I.e. P(upday|hammer)=P(upday,hammer)/P(hammer), for instance.

But how could we learn the corpus of symbols, without the tedious method of identifying each individual symbol by hand? This is a problem that may better be approached by unsupervised learning. In unsupervised learning, we don't need to train a learner; it finds relationships by itself. Typically, the relationships are established as a function of distance between exemplars. Please see the data mining text (Witten/Frank) in my recommended list in order to examine the concepts in more detail.  In this case I am going to train using a very common unsupervised learner called k-means clustering.

 Fig 2. Graph of arbitrary window of QQQQ data

Notice the common time ordered candlestick form of plotting is displayed in Fig 2.  Now using k-means clustering, with a goal of identifying 6 clusters, I tried to automatically learn 6 unique candlestick forms based on H,L,Cl data relative to Open in this example. The idea being that similar candlestick archetypes will tend to cluster by distance.

Fig 3. Candlestick symbols sorted by 6 Clusters

Notice in Figure 3, that we can clearly see the k-means clustering approach automatically recognized large red bodies, green bodies, and even more interestingly, there are a preponderance of hammers that were automatically recognized in cluster number 5.

So given that we have identified a corpus of 6 symbols in our language, of what use might this be? Well, we can take and run a cross tabulation of our symbol states using a program like R.

Fig 4. Cross Tabulation of Clustered States.

One of the things that strikes me right away is that there are an overwhelming number of pairs with state 1s following state 5; Notice the 57% frequency trounces all other dependent states.  Now what is interesting about this? Remember we established that state 5 corresponds to a hammer candlestick? Well, common intuition (at least from my years of reading) expects that a hammer is a turning point that is followed by an up move. Yet, in our table we see it is overwhelmingly followed by state 1, which if you look back, at the sorted by cluster diagram, is a very big red down candlestick. This is completely opposite to what our common body of knowledge and intuition tells us.

In case it might seem unbelievable to fathom, we can resort the data again, this time in original time order, but with clusters identified.

Fig 5.  Visual Inspection of hammer (state5) likely followed by down candle (state 1)

We can go back and resort the data and identify the states via the resorted cluster ID staircase level(5), or use labels to more simply identify the case of hammer(5) and its following symbol. Notice that contrary to common knowledge, our automatic recognition process and tabulated probability matrix, found good corroboration with our visual inspection. In the simple window sample (resized to improve visibility), 4 of the 5 instances of the hammer (state 5) were followed by a big red down candle (state 1). Now one other comment to make is that in case the state 5 is not followed by state 1 (say, we bet on expecting a down move), it has a 14.3% chance of landing in state 6 on the next move, which brings our likelihood of a decent sized down move to 71.4% overall.

We can take these simple quantitative ideas and extend them to MCMC dynamic models, Baum Welch and Viterbi algorithms, and all that sophisticated stuff. Perhaps one day even mimicking the mighty Renaissance itself?  I don't know, but any edge we can add to our arsenal will surely help.

Take some time to read the Quants, if you want a great laymen's view of many related quant approaches.

There may be some bugs in this post, as google just seemed to update their editing platform, and I'm trying to iron out some of the kinks.

Tuesday, May 25, 2010

The Kalman Filter For Financial Time Series

Every now and then I come across a tool that is so bogged down in pages of esoteric mathematical calculations, it becomes difficult to get even a simple grasp of how or why they might be useful. Even worse, you exhaustively search the internet to find a simple picture that might express a thousand equations, but find nothing. The kalman filter is one of those tools. Extremely useful, yet, very difficult to understand conceptually because of the complex mathematical jargon. Below is a simple plot of a kalman filtered version of a random walk (for now, we will use that as an estimate of a financial time series).

Fig 1. Kalman Filter estimates of mean and covariance of Random Walk

The kf is a fantastic example of an adaptive model, more specifically, a dynamic linear model, that is able to adapt to an ever changing environment. Unlike a simple moving average or FIR that has a fixed set of windowing parameters, the kalman filter constantly updates the information to produce adaptive filtering on the fly. Although there are a few TA based adaptive filters, such as Kaufman Adaptive Moving Average and variations of the exponential moving average; neither captures the optimal estimation of the series in the way that the KF does. In the plot in Fig 1. We have a blue line which represents the estimated dynamic 'average' of the underlying time series, where the red line represents the time series itself, and lastly, the dotted lines represent a scaled covariance estimate of the time series against the estimated average. Notice that unlike many other filters, the estimated average is a very good measure of the 'true' moving center of the time series.

Without diving into too much math, the following is the well known 'state space equation' of the kf:
xt=A*xt-1 + w
zt=H*xt + v

Although these equations are often expressed in state space or matrix representation, making them somewhat complicated to the layman, if you are familiar with simple linear regression it might make more sense.
Let's define the variables:
xt is the hidden variable that is estimated, in this case it represents the best estimate of the dynamic mean or dynamic center of the time series
A is the state transition matrix, or I often think of it as similar to the autoregressive coefficient in an AR model; think of it as Beta in a linear regression here.
w is the noise of the model.

So, we can think of the equation of x=Ax-1 + w as being very similar to the basic linear regression model, which it is. The main difference being that the kf constantly updates the estimates at each iteration in an online fashion. Those familiar with control systems might understand it as a feedback mechanism, that adjusts for error. Since we can not actually 'see' the true dynamic center in the future, only estimate it, we think of x as a 'hidden' variable.

The other equation is linked directly to the first.
zt is the measured noisy state variable that has a probabilistic relationship to x.
xt we recognize as the estimate of the dynamic center of the time series.
v is the noise of the model.

Again, it is a linear model, but this time the equation contains something we can observe: zt is the value of the time series we are trying to capture and model with respect to xt. More specifically, it is an estimate of the covariance, or co-movement between the observed variable, the time series value, and the estimate of the dynamic variable x. You can also think of the scaled envelope it creates as similar to a standard deviation band that predicts the future variance of the signal with respect to x.

Those familiar with hidden markov models, might recognize the concept of hidden and observed state variables displayed here.

Basically, we start out estimating our guess of the the average and covariance of the hidden series based upon measurements of the observable series, which in this case are simply the normal parameters N(mean, std) used to generate the random walk. From there, the linear matrix equations are used to estimate the values of cov x and x, using linear matrix operations. The key is that once an estimate is made, the value of the covariance of x is then checked against the actual observable time series value, y, and a parameter called K is adjusted to update the prior estimates. Each time K is updated, the value of the estimate of x is updated via:
xt_new_est=xt_est + K*(zt - H*x_est). The value of K generally converges to a stable value, when the underlying series is truly gaussian (as seen in fig 1. during the start of the series, it learns). After a few iterations, the optimal value of K is pretty stable, so the model has learned or adapted to the underlying series.

Some advantages to the kalman filter are that is is predictive and adaptive, as it looks forward with an estimate of the covariance and mean of the time series one step into the future and unlike a Neural Network, it does NOT require stationary data.
Those working on the Neural Network tutorials, hopefully see a big advantage here.

It has a very close to smooth representation of the series, while not requiring peeking into the future.

Disadvantages are that the filter model assumes linear dependencies, and is based upon noise terms that are gaussian generated. As we know, financial markets are not exactly gaussian, since they tend to have fat tails more often than we would expect, non-normal higher moments, and the series exhibit heteroskedasticity clustering. Another more advanced filter that addresses these issues is the particle filter, which uses sampling methods to generate the underlying distribution parameters.

Here are some references which may further help in understanding of the kalman filter.
In addition, there is a kalman smoother in the R package, DLM.

If you are interested in a Python based approach, I highly recommend the following book...Machine Learning An Algorithmic Perspective

Not only is there a fantastic writeup on hidden markov models and kalman filters, but there is real code you can replicate. It is one of the best practical books on Machine Learning I have come across-- period.

Wednesday, May 12, 2010

Is it possible to get a causal smoothed filter ?

Although I haven't been all that much of a fan of moving average based methods, I've observed some discussions and made some attempts to determine if it's possible to get an actual smoothed filter with a causal model. Anyone who's worked on financial time series filters knows that the bane of filtering is getting a smooth response with very low delay. Ironically, one would think that you need a very small moving average length to accomplish a causal filter with decent lag properties; often a sacrifice is made between choosing a large parameter to obtain decent smoothing at the cost of lag.

I put together the following FIR based filter using QQQQ daily data for about 1 year worth of data. It is completely causal and described by .. gasp.. 250 coefficients.

Does it appear smooth? You decide.

Fig 1. FIR 250 tap feed forward filter

Fig 2. 250 weight impulse response determining coefficients

The impulse response is approximately a sinc function, which is the discrete inverse transform for an ideal 'brick wall' low pass filter.

I haven't actually verified much out of sample at the moment, so it's quite possible that the model may not fare as well; it remains to be investigated. However, thought I would share this work to give some ideas about potential of causal filtering methods.

Wednesday, April 28, 2010

Wavelet Spectrogram Non-Stationary Financial Time Series analysis using R (TTR/Quantmod/dPlR) with USDEUR

I've been doing some research lately regarding types of spectral imaging and decomposition techniques that apply to non-stationary signals. As mentioned earlier, one of the major problems with the simple fourier analysis is that the basis functions extend to infinity in both directions and the signals are assumed to be stationary. Although, I won't expand too much right now, one of the advantages of wavelets is that they use local small windowed basis functions, allowing them to capture not only non-stationary signals, but signals that are aperiodic: two large advantages over fourier based methods when dealing with financial time series.

I put together a few small examples to understand how to visually understand a spectrogram.

Fig 1. Simple 58 day cycle captured with 11 octaves and 2048 (2^11) data points

As in earlier tutorial based posts, we use a simple 58 day cycle to show the basic time series sine based waveform. Now the plot on the bottom is known as a spectrogram. The type of wavelet operation for this spectrogram is known as a continuous wave Morlet transform. The package is dpLR (The Dendrochronology Program Library) put together by Andy Bunn . The package was designed to analyze tree rings. Notice that there are a multitude of tools utilizing this type of technology, ranging from MRIs, to climatology, to speech processing. It is, IMO, the modern day version of dft type spectral tools (however, for non-stationary and aperiodic signals). Now looking on the spectrogram plot, please keep in mind the units are Days not Years (I need to see how to alter that, hopefully Dr. Bunn is listening=).

The time scales represents linear time, or a window of 2048 days that was sampled. We could have used any time series, but it needs to be length=2^N; if not, there is a function to pad the rest of the data with zeros to make up that length. The vertical scale is a log scale that shows what are called 'octaves'. Borrowing from musical vernacular, we can think of them of scales which double in magnitude for every prior scale and represent localized frequency energy information at such scales. The colors represent the heat or power of the signal in regions of interest. Due to some issues with this transform, we ignore uncertain information outside of the dark parabolic region (cone of influence). It is clear that the highest power is the dark red region right at around 58 days. What is important here is not so much to understand the exact value of the cycle, but the persistence in the dominant cycle (s). We notice the cycle persists throughout the entire spectrogram Time Series length (much as we would expect from the 2D time series plot).

What happens if we use different frequencies that change over time? Here we notice a clear advantage over fourier based methods. A fourier based decomposition would be able to locate the dominant tones, however, because it uses infinite bases, the reconstructed signal would not capture the isolation of different frequencies.

Fig 2. Composite Stationary Time Series comprised of 3 dominant tones

Notice, that we can clearly see the regions of dominant tones by following the chart and looking for the most concentrated power (red) regions, which are around 48, 253, and 532 day cycles. We also notice that the power density can be viewed in terms of time context, our eyes simply follow along in time and observe strong regions of signal energy concentration.

Ok, but what about if the signal itself is non-stationary?

Fig 3. Composite signal added to exponential curve to make signal non-stationary

Notice, that even though we now have a non-stationary signal, the regions of underlying cyclic component stability are still detectable by eye!

Lastly, a financial time series of USDEUR was captured via TTR/Quantmod packages.

Fig 4. USDEUR time series spectrogram

Notice even with the non-stationary financial signal, there is a very clear dominant cycle pattern that is persistent at roughly 255 days (anyone familiar with trading recognizes that as the approximate number of trading days per year).

Keep in mind that there are also aliases (and spreading) present in sampling methods which may look like periodic signals, but are merely digital artifacts of the underlying sampled signal. We also see the very short term noise present in the bottom lower scales.

Another interesting application of this is that it may not only be used as a modern tool to augment non-stationary decomposition, but for those familiar with pattern based techniques, it (and the periodogram counterpart) is often used in pattern recognition and markov type modeling.

That's all for now. Hopefully, you have gained some appreciation for wavelet based spectral techniques vs. Fourier spectral based analysis.

I have been debating whether to break up the post, but because I was added to the R bloggers thread, I wanted the post to be complete for local readers.

That's it for now.

Saturday, April 3, 2010

Why isn't my 2X Ultra ETF keeping pace with the market and what is path asymmetry (R ex)? Part 2

I created an example to show how the theory from part 1 might be applied using S&P500 as a proxy for performance. Just in case anyone viewing is not familiar with terminal wealth, it is the final (usually compounded) ending value (hence, terminal) of the account.

Fig 1. Example of S&P 500 and using GBM monte carlo simulations for terminal wealth

A monte carlo simulation of GBM, using historical daily%change parameters(mean,std), was run for 10,000 iterations of a time series length=1000. The length was chosen to approximate slices of about 3yrs for summary statistics of terminal wealth (a good approximation for market timing). I also used the long term historical mu and std of the series, although it might be a bit biased towards longer horizons. Possibly, I could generate more of a 3yr sampling distribution of N(u,std), for more relevance, but for now we'll assume the long run parameters are a good approximation.

Graphical summary statistics using boxplots and density estimates are shown for the monte carlo simulations. What strikes me at first glance, is that the -2x instrument performs absolutely horrible in most cases, adding to the common knowledge that markets have upwards drift. If you are ever stuck holding a position, just hope it isn't short (we've all experienced the deer in the headlight phenomenon at one time or another); statistically, it is not the best side to be stuck on for any long period.

Another more interesting observation, however, is that the simple 1X underlying instrument mode is to the right of all the density estimates. In addition, you are clearly taking on wider variance/risk, by using the positive (and neg) leveraged 2x instrument. In essence, you are seeing some of kelly principles at work here. By taking on 2X risk, while you have a chance of larger gains, statistically, you are not likely to do too much better than 1x, while taking on far greater risk on the negative side.

Lastly, there are two sample slices shown of the actual results, using arbitrary periods of performance. It is clear, that during periods of long trends, we have much better growth in the 2X instrument, unfortunately, we don't know when those trends will occur, and secondly, according to the monte carlo sims, they are not that likely to occur.

The most recent performance, displayed, is a perfect example of a series where both 2X instruments performed worse than the underlying, as explained in part 1.

Below is a summary of the three series, ser(1X), ser2pos(+2X), ser2neg(-2X)
> summary(ser)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3613 1.0800 1.3290 1.3870 1.6250 4.7460
> summary(ser2pos)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1178 1.0630 1.6100 1.9180 2.4070 20.5700
> summary(ser2neg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0337 0.2859 0.4279 0.5173 0.6483 5.6480

Notice the Median of +2X is nowhere near 2 times the Median of the underlying. Although 2X has some fantastic outliers, you shouldn't expect them statistically.
It's sort of like tossing a coin with compounding the full amount, whereby, you get a fantastic result for the winning outcome, unfortunately, there is a 75% probability of going bankrupt (maybe I'll cover that one another time).

One final comment is that the monte carlo sims used GBM, whereas a more likely jump diffusion process would create much fatter tails, meaning even more neg tail risk against the potentially nice looking 2X instrument potential gains.

Wednesday, March 31, 2010

Why isn't my 2X Ultra ETF keeping pace with the market and what is path asymmetry (R ex)?

I've been reading a few articles lately, lambasting ultra ETFs for not keeping up with markets and ascribing the problem to weird unexplainable reasons such as portfolio derivative re-balancing and negative drift. I thought it would be nice to revisit the concept of path asymmetry. Although there are many different definitions of price asymmetry (econometrics for example), in this case I'm simply referring to the asymmetrical nature of percentage price movements vs dollar movements and their final cumulative outcome given any arbitrary path.

Fig 1. Example of ultra 2X ETFs and path asymmetry

Many people seem to find it incomprehensible (if not reprehensible) that an underlying series may move a certain direction, yet, both the ultra short and ultra long series both finish below the underlying over the long run. What exactly is path asymmetry? Some traders might be familiar with the notion that if you lose some percentage of your account, like 50%, that you need more than 50% to make up for the loss. That is an example of path asymmetry (I should note someone also mentioned it's an example of Seigel's paradox).

Let's look at a very simple example of how this might affect a stock and it's 2x counterparts. Suppose a stock moves from 100dollars to 80 and back to 100 again-- break-even. The move from 100 to 80 on a percentage basis, was a 20% loss. However, to recoup that amount, we need to solve for 80*(1+x)=100; the answer is 25%, not 20%. This means even though the dollar amount is identical for both moves (20dollars down and up), the %amount is not. That is an example of path asymmetry. How does this affect the 2X ultra Leveraged ETFs? Well, since each ETF is designed to track twice the daily move of the underlying, the the +2x ETF will move 40% down, then it will move 50% back up, for a net dollar ending value of 90 dollars. The -2x ETF will move up 2x or 40% to 140, and then retrace -50% leaving it at only 70 dollars. Notice in both cases, each ultra ETF ends up below the underlying price. It is the simple mechanics of path dependency and asymmetry that account for this, even with perfect 2x leveraging. It is important to take into account path dependencies when dealing with any leveraged product, including hedging.

Now keep in mind, there is additional drag on these products, due to fund expenses, which does add merit to the original question. More on this is explained succinctly in this article by Alpha's Tristan Yates and Lye Kok .

Wednesday, March 24, 2010

Modified Donchian Band Trend Follower using R, Quantmod, TTR -Part 2: Parameter Sweep Sensitivity over long run

Here is a small update to the Donchian Channel type system I displayed in the last post.

Fig 1. Sensitivity of Net Combined L/S Gain to parameter n.

Using the S&P500 index as a proxy for the market, a simulation was run over the lifetime of the index. Notice the system excels in both the very short run, and much longer periods. The short system did very poorly overall and did not perform nowhere near the long side in any of the overall periods (except maybe very short term). A possible explanation is that short side systems do not do very well in the long run due to upward drift of markets. In addition, short side runs do not have the inherent compounding power of long sides as they are asymmetrical. The most you gain on a short run is double your original value, where the long side is unlimited (one way around this limitation is using inverse ETFs). I believe many common simulators err in the effects and method of this computation.

Fig 2. Some long term results of strategy with parameter n=140

The above figure shows the results of choosing a parameter near the optimal region. In light of commissions and limited short strategy performance over longer periods, it might pay to use the long only portion of the strategy. Another observation is to possibly step aside during highly volatile regions in order to capture the beneficial areas of the long strategy. Some of the methods to approach this type of regime switching have been mentioned in earlier posts.

One last comment to think about when hearing detractors regarding 'curve fitting' and optimization, is that as evidenced in the above simulation, you will often find the the local optimal parameter value turns out to be the most robust, as it will perform best over a wide range of sensitivity to parametrization.

Friday, March 12, 2010

Modified Donchian Band Trend Follower using R, Quantmod, TTR

I've been toying around with the examples given on the FOSS trading site for some of the great work they've put together in the Quantmod and TTR packages. Those viewers who are looking for a nice (and free) backtesting suite to possibly complement some of your other results or work in say, Weka, should familiarize yourselves with R. Not only can it serve as a canvas to simulate ideas and concepts, but can process the backend results towards more trader oriented metrics, than using something like Weka as a standalone tool. As you gain more proficiency in data mining and machine learning concepts in Weka, you can also make the move to integrate the tools inside of R, as R contains the majority of machine learning schemes inside of various packages.

Fig 1. Modified Donchian Channel System Simulation

As an example of how to use some of their tools (along with traditional R packages) for fast prototyping, I put together an example of a modified Donchian Channel trend following system along with how you might simulate it using R. The typical Donchian Channel Bands are used as breakout entry and exit signals. I.e. once an n period high has been breached you go long, then exit when the n period low has been breached -- visa versa for short. In this example, however, we simply enter long on the average line break and stay long as long as it is above. A short signal is entered on the average line break to the downside. Unlike a price/moving average type system, there wasn't a lot of choppiness causing false starts around the average line, which is a plus.

I am still trying to familiarize myself more with the tools, and am still at the point where I like how simple and fast the static vector computations work (similar to numpy in Python), but I am wondering how fast more sophisticated entry/exits requiring loops will work. I still expect to work on some of these types of scenarios, as I am really enjoying the capabilities of R along with some of these trading oriented packages.

Although the system (using QQQQ as an example) is in no way optimized nor analyzed for robustness, it returned a respectable 60% versus a buy and hold loss over the past roughly two years (showing a simple example of trend type trading).

Here is the complete code for you to replicate (I used the R version
Note: if some of it looks familiar to the FOSS RSI example, it is exactly because I used that example as a starting point, so there will be some overlap in comments and actions.

# We will need the quantmod package for charting and pulling
# data and the TTR package to calculate Donchian Bands.
# You can install packages via: install.packages("packageName")
# install.packages(c("quantmod","TTR"))
# See Foss Trading Blog for RSI template


end<- "2010-03-08"

# Pull tckr index data from Yahoo! Finance
getSymbols(tckr, from=start, to=end)<-QQQQ[,6]

#Plotting Donchian Channel

par(mfrow=c(2,2), oma=c(2,2,2,2))

# max, avg, min <- red, blue, green
plot(,ylim=c(ymin,ymax),pch=15,main="donchian bands max/avg/min")

# Create the long (up) and short (dn) signals
sigup <-ifelse( > dc[,2],1,0)
sigdn <-ifelse( < dc[,2],-1,0)

# Lag signals to align with days in market,
# not days signals were generated
sigup <- lag(sigup,1) # Note k=1 implies a move *forward*
sigdn <- lag(sigdn,1) # Note k=1 implies a move *forward*

# Replace missing signals with no position
# (generally just at beginning of series)
sigup[] <- 0
sigdn[] <- 0

# Combine both signals into one vector
sig <- sigup + sigdn

# Calculate Close-to-Close returns
ret <- ROC(tckr_obj[,6])
ret[1] <- 0

# Calculate equity curves
eq_up <- cumprod(1+ret*sigup)
eq_dn <- cumprod(1+ret*sigdn)
eq_all <- cumprod(1+ret*sig)

plot(eq_all,ylab="Combined",col="blue",main="combined L/S equity")
title("Modified Donchian Band Trend Following System (", outer = TRUE)


P.S. As always, please use your own due diligence in all work borrowed from this site. There are some areas that I believe are not quite correct in the simulation framework, needless to say, you have a complete script to start your own examples and backtesting.

Wednesday, February 24, 2010

FFT (Fast Fourier Transform) of time series -- promises and pitfalls towards trading

Fig 1. FFT transformed time series (EBAY) reconstructed with first three and twenty harmonics, respectively.

I see quite a few traders interested in advanced signal processing techniques. It is often instructive to see why they may or may not be useful. The concept behind fourier analysis is that any periodic signal can be broken down into a taylor series or sum of suitably scaled sine and cosine waveforms (even a square wave!). The key requirement is that the signals are periodic, which means that they repeat forwards and backwards to plus and minus infinity. Anyone who deals with financial series knows they are aperiodic (meaning they do not repeat indefinitely). The FFT, or fast fourier transform is an algorithm that essentially uses convolution techniques to efficiently find the magnitude and location of the tones that make up the signal of interest. We can often play with the FFT spectrum, by adding and removing successive tones (which is akin to selectively filtering particular tones that make up the signal), in order to obtain a smoothed version of the underlying signal.

In the posted example, I showed the effect of reconstructing the transformed waveform using only the first three tones (and cutting off all others), where we see a low passed version of the signal. The second example includes the first 20 tones, which begins to match the signal more closely, but is a smoothed representation of the signal, which is often a nice representation to isolate smoothed signal component from high frequency noise. Notice the terms tones and harmonics are practically synonymous for purposes of this discussion (a harmonic is more specifically a multiple of the fundamental tone); both represent the spectral frequency components that sum up to make the total waveform. The major problem that I wanted to illustrate with this simple example (among many), is the problem of 'wraparound effects.' As I mentioned earlier, one of the requirements for properly applying a fourier transform is that the signal is periodic or repeating, since the basis functions (sines and cosines) that are convolved are infinitely repeating functions.

With that requirement, the reconstructed waveform tries its best to match the beginning and endpoints for periodic repetition. The result is severe problems at the endpoints (left and right), which are often the points we are most concerned about. So it often pays to be cautious when hearing about applications of higher level signal processing techniques. There are several other requirements and limitations to applying FFT techniques, among them: requirement of 2^n samples, fsample must be greater than or equal to twice the max bandwidth of sampled signal (nyquist criterion), limited spectral tone bin resolution; ignoring any of these issues can cause severe reconstruction errors.

Monday, February 22, 2010

Time Series Calendar Heat Maps Using R

I came across an interesting blog that showcased Charting time series as calendar heat maps in R . It is based upon a great algorithm created by Paul Bleicher,CMO of Humedica. I'll let you link to the other blog to see more details on the background and original source code.

I made a very small modification to allow %daily changes, rather than price values.

calendarHeat($Date[1:length($Date)-1], stock.dailychange, varname="SPY daily % changes(CL-CL)")

Fig 1. Calendar Heat Map for SPY time series 2005-Present

What's interesting is you can see how unusual events tend to Cluster (heteroscedasticity) , and a preponderance of low change days (as would be expected in close to Gaussian distributions). Using the regions of clustering might help warn of impeding catastrophic regimes (as seen in late 08), similar to using VIX as a proxy. In addition, the 10,000 foot bird's eye view, might allow you to spot pockets of order for further evaluation.

Saturday, February 20, 2010

Genetic Algorithm Systematic Trading Development -- Part 3 (Python/VBA)

As mentioned in prior posts, it is not possible to use the standard Weka GUI to instantiate a Genetic Algorithm, other than for feature selection. Part of the reason is that there is no generic algorithm to instantiate a fitness function. The same flexibility that allows an infinite possible range of fitnesses also requires custom scripting. Although it is possible to write a custom class for Weka/JAVA, I chose to utilize Python for this example, along with an older VBA tool I developed for the back-end results summary. Hopefully, you'll see that there are many tools that may be utilized to prototype various systems and augment the development process.

The essential GA uses a 17 bit string length to encode the following rule:

{if ma(m) binop ma(n) then buy}

The first 8 bits are used to encode the 1st ma value. Note there are 2^n = 2^8 = 256 potential decimal values that can be used for the parameter argument. The 9th bit is a 2 bit encoded value of the > or < binary operator as discussed in prior posts. The last 8 bits are used for the 2nd moving average parameter value. A simple fitness of the net dollar return was used for this example (Note Sharpe ratio, and other fitness metrics could have been used). The input series is SPY, using the range from 1993-2005 daily to optimize.

The python script was essentially set up to run 40 generations of a population of size 20 using elitism and tournament selection. Although this is by no means optimal (it is quite small), it was set up using these values for illustrative purposes. When you watch the video, what you'll see is the initial population in binary encoded strings each time a generation is passed. In addition, the decoded moving average rule is shown for each selection change. Although the video has been truncated for brevity, you should notice that the fitness number is improving each generation. The final solution was designed to halt after a fitness did not improve over five generations. In addition, you can see the final encoded result and a plot of the fitness convergence.

Video 1. Optimization of MA parameters using Python GA

Fig 1. Final Fitness result output to console

In fig 1. we see that the final rule converged to {if ma(220) > ma(221)) then Buy.
In addition, the final binary string is shown along with the final fitness value.
We can decode the binary string with relative ease.
[110110111110111100] is the 17 bit string representing the optimal fitness.
ma1 is 1st 8 bits = 11011011 = 219 decimal a +1 offset was used (so as not to have 0 day moving average) to get a resulting parameter argument of 220.
The next bit is = 1 corresponding to >
The final 8 bits represent the 221 argument by similar reasoning as the first.
So the resulting rule with parameters is:
if ma(220) > ma(221) then Buy
fitness = net$gain = $316.12

fig 2. Fitness Convergence

In fig 2. We see how the fitness continued to climb over successive generations until early convergence caused a halt at the fitness value that did not change over the prior 5 generations.

In order to verify the results, we will also show how other tools may be used. In this case, I used an older VBA simulator that I wrote a few years back.

Video 2. Summary of optimized parameters using VBA/Excel

Fig 3. Summary of Back Test Results

Above is a capture of the summary statistics using the back test program. Total net profit is slightly higher than the python results. This is due to the fact that the python simulation truncated the series length of the moving average data, so as to avoid zero front padded values, while the excel program did not. However, they are still in close agreement. It's often useful to use several different programs to force yourself to double check results.

Now, as an astute commenter already pointed out... this method is indeed curve fitting. What we found was the best possible pair of parameters(or at least one of the best; there are superior parameters, but I didn't run the example generation set too long) for our particular rule set we set out to investigate. Or as I mentioned in the first thread, we zeroed in on the region of the distribution curve with the most profitable candidates. Now, for those of you not familiar with curve fitting, it is not a happy concept amongst developers. In fact, it suffers from almost the same egregious problems as cherry picking examples, as I mentioned earlier on.

That being said, however, it is not done in vain. Our goal here is to quantitatively augment common development (the part where you create and verify) tools beyond mere guessing, intuition, and cherry picking. Firstly, it is possible that this particular rule set will not fare as well out of sample, which is true. However, in the same sense that we can not just take one cherry picked example for granted, we must also evaluate how things actually do perform out of sample. I say this because I've used similar techniques that looked very good, and did indeed perform very well out of sample for several periods out into the future. By honing in on the best candidates, we help to narrow down the set of candidates that are worthy of out of sample investigation. There are other additional techniques (some mentioned earlier, such as ensemble methods, different objective/fitness functions, and even different optimization criteria) that can be used to enhance this method, and in addition, verify robustness out of sample.

edit: Just for giggles, I decided to actually run the Out of Sample performance on this optimized in sample trained rule. The following chart illustrates how it performed 'out of sample' for the years 2005-today(2010).

Fig 4. Out of Sample Test Performance on optimized training rule parameters.

Not all that shabby for that curve fitted simple system during the worst meltdown in recent history, eh (much easier on the gut)?

To be frank, I have run so many evaluations on simple SMA systems, that I would say that they are not the most superior parameters to optimize around. Obviously, however, it really depends on what your objective is. There are some long term studies that have shown using the fitness objective of reduced volatility as the goal is quite beneficial with this simple rule set (you can verify that this simple system had far less volatility over the down periods, than the actual market-- in and out of sample) . It is up to you to find those parameters that are worthy of optimizing further. See commentary on A Quantitative Approach to Tactical Asset Allocation for a related example.

As always, please do your own due diligence before making any trading decisions.

And please continue to give your feedback on what you like or don't like and areas you want to explore.
If you are new to Python and would like to order a fantastic textbook, I highly recommend the following (applications geared a bit towards science and engineering): A Primer on Scientific Programming with Python

In addition, users who are interested in learning a bit more about VBA with a Financial Oriented slant will find great practical examples in the text: Financial Modeling, 3rd Edition

Wednesday, February 17, 2010

Genetic Algorithm Systematic Trading Development-- Part 2

We started by discussing the goal of a genetic algorithm, which is to optimally find the candidate pool of rules that are superior to other potential rules. In our example of moving averages, we are seeking the values of parameters of the rule :
if ma(m) binop ma(n) then action.
*Note: binop is short for binary operator; in this case the binary operator is > or <.

The GA (genetic algorithm) works by encoding the rule set into a string of binary valued variables. For instance if we wanted to encode the moving average parameter
to 4 real decimal values, we could simply use a 2 bit string, where 00 = 0 decimal, 01= 1 decimal, 10 = 2 decimal and 11 = 3 decimal. We can encode up to 2^n values per bits contained in each string. If we wanted to encode 512 values, we would need a 9 bit string to encode this value (2^9=512).

Also, we can encode values other than decimal values as binary bits, for instance,
action = buy or sell, can be represented by 0 or 1. Greater or Less than (> or <) can be represented by 1 or 0, as well. In the end we will have a chromosome or total string that represents the rule we are trying to optimize. So the rule: if {ma(m) binop ma(n) then action} could be encoded by binary values, as each chromosome is represented by 4bits- 2bits- 4- bits- 2 bits, where each element of the rule string corresponds to the encoded values discussed above. Note that the encoded blocks would be comparable to genes.

Fig 1. Examples of Boolean Encoding

Once we encode our rule set into a boolen representation or string, we then want to generate a population of strings to select from. Typically, we start out by assigning random values to the parameters. For instance, we may start a population of 100 strings; each string representing a set of rules with different parameters.
One string could be if ma(10)<ma(50) then buy, another might be if ma(20)>ma(200) then sell. Once a population has initially been created, we need to create diversity and additionally successfully improve fitness in the offspring over successive generations.

The concept of fitness is perhaps one of the most elegant and flexible options that makes the GA such a powerful optimizer. In the decision tree learners and Neural Network learners we discussed, there are only one or two simple goals to train on (decision tree for instance trains towards goal of reducing information entropy, neural net trains on reducing fitted variance errors). The GA can use any goal you can imagine, which gives it unlimited flexibility compared to others. You could use
total gain as a goal, or sharpe ratio, or profit factor as goals. You could even combine goals. The fitness or goal is what you are trying to optimize. Keep in mind that a genetic algorithm, like any other learner does not guarantee you will find the absolute best, it may get caught in local maxima of the fitness landscape.
However, you can get more sophisticated and add other sub methods to try to avoid this.

Fig 2. Example of population of rules to be processed.

Once our initial population of parameter based rules has been created (randomly), we then want to think about how we achieve the goal of optimizing or finding the set of rule parameters that best optimizes our fitness. Note that each time we create a new population of offspring, we call this a new generation run or epoch. The first set of offspring or parents, commonly attempts to select some sample of the fittest members to be passed along to the next population. We could use a greedy method that just sorts or ranks the members and selects only from the best 50% to be passed to the next generation (known as truncation selection) or an alternate method is to use something called roulette selection. In roulette selection, we sample members of the original population based upon their normalized fitness. So if the best fitness was 20% of the value of the sum of all the finesses, we would copy over that string or rule with a 20% probability into the next generation. The same would be applied to the other fitness/string combinations. Ultimately, we end up with more of the offspring selected from the most fit candidates in the prior generation. Next, we want to assure some diversity in the offspring. Crossover operation achieves this by crossing over or swapping genes from one candidate and another. This is performed over the entire population to ensure diversity. Lastly, we use mutation to randomly select some number of string elements and flip them. It adds a bit more random diversity to the offspring, so that possibly some candidate may show up unexpectedly that has great performance (think unusual height in basketball for instance).
More bells and whistles can be added to improve performance. Tournament selection is another method that improves offspring selection by running a tournament between string candidates. The winning candidate gets passed along to the next generation.

Fig 3. Selection process.

We essentially run the optimization and diversity routines through each generation, and the best candidates get passed down through to the next generation until our number of generations has run out, or we specify an early stopping criterion.

In the case of our rule set, we expect it to converge to the best set of parameters
(moving average arguments, and binary greater or equal than operator), based upon the fitness goal we assign to it.

Next. Genetic Algorithm Systematic Trading Development-- Part 3

Monday, February 15, 2010

Genetic Algorithm Systematic Trading Development -- Part 1

I want to start with a brief introduction to what I consider one of the most powerful learning methodologies to come out of Artificial Intelligence in the last several decades-- the Genetic Algorithm. Although it was originally developed to model evolutionary biology in the late 50s, most give credit to John Holland for his detailed contributions to the development of the field. A professor of adaptive systems at the University of Michigan, he wrote a text, titled "Adaptation in Natural and Artificial Systems" in 1975, that is considered a landmark book in the field.

Although GAs are designed to borrow from our Genetic Biology, I want to quickly describe why they are powerful with respect to trading systems development. As a trader, you might often develop systems using creative ideas borrowed from Technical Analysis books you may have read. One of the problems with earlier TA books in general, IMO, is that they often have "cherry picked" examples of parameter sets, without much explanation as to how they arrived at the parameters, nor how well they might perform against other parameters. In statistics, we are often interested in gathering many different samples to build a distribution profile of the outcomes as an estimate of the true population of all possible candidate solutions. We often look at these distributions of models to gather a quantitative deduction about whether our particular system (with the parameters we selected) has performed better than any other potential system in the universe of all possible candidate solutions.

If the system performed better than some designated percentage of the sample distribution area of 100% (often set at 1% or 5% in common literature), then we can say that the result compared to the universe of candidates is "statistically significant". Using different parameters for the same set of systematic rules will give different sample outcomes that make up that distribution. For instance, using moving average crossovers, we might end up selecting one pair of moving average values to determine entry and exit with a resulting profit of .1%, while another combination over the same period yielded 2.3%. Ultimately we want to find the set of pairs that performs the best, or at least significantly better than Buy and Hold, otherwise there's typically not much incentive to trade in and out as commission costs and other negative effects make it prohibitive. We could try to run various parameters by guessing or enumerating over the search space of potential solutions, but at a certain point, the number of combinations becomes unwieldy and is not computationally efficient. The first step might be to evaluate the parameters of our system and look for those parameters that yield statistically significant results, the next might be to compare that candidate to buy and hold or other potential system candidates using a t-test of the separate distributions.

Let's take an example of a potential set of rules to illustrate this idea. Suppose we sat down one day and decided upon a rule that said to buy if the m period moving average was greater or less than the n period moving average. First, we need to decide upon what range of values to use for the averages. If we discretize the range of values to integer values, i.e. 1 to 512 steps each, we would have 512X512x2 (where 2 represents greater or less than)= 542,288 different parameters to enumerate through (or try). Although that doesn't seem too large of a number of combinations to try with today's computational power, as we begin to make the rules more complex, the number of combinations will begin to run into the millions. It's just not feasible to try all of them, so we want to find some method to reduce the number of potential candidates, while at the same time finding the best possible results. What we are trying to do is find an 'optimal' algorithm to converge to the best solutions quickly. There are numerous algorithms employed in the field of machine learning, under the category of optimization algorithms that exist to achieve this goal. The genetic algorithm is one such optimization algorithm that borrows directly from our own evolutionary genetic system to find the best potential candidate, without having to literally try out every single possible combination.

Fig1. Example of searching for statistically superior parameters.

Above, we see an example distribution of possible candidate solutions in the population of potential parameter pairs with the x-axis representing binned ranges of potential gain for the system, and y representing the frequency of parameter pair outcomes corresponding to that gain. Our Genetic Algorithm will help us to find those solutions that are statistically significant compared to potential solutions.

Next: Genetic Algorithm Systematic Trading Development -- Part 2

Thursday, February 11, 2010

Artificial Immune Systems and Financial Applications?

One of the buzzwords that seems to be common these days is AIS or Artificial Immune Systems. It is a biologically inspired classification type system that essentially tries to replicate some of our own natural immune system algorithms. Our bodies have various defense mechanisms for recognizing foreign invaders. One such defense mechanism is utilizing pathogen-associated molecular patterns built into our genome so that we can identify foreign pathogens and respond to them. This concept of being able to distinguish between self and non-self is essentially one of the main themes that is borrowed from our natural biological defense.

The idea is that we can learn to recognize objects that are not normal and respond to them. A good example where this has been put to use is in SPAM detection. We want to be able to recognize good (known) mail from the (unknown) bad, and utilize AIS to avoid parasitic SPAM.

The most common AIS system that has been adopted is the Negative Selection Algorithm. It essentially works by training a classifier to generate negative examples randomly and if the negative example (imagine a coordinate on a 2D grid) happens to be located (using some distance metric, such as euclidean or mahanalobis distance) near enough to a known good example, then it is rejected. Essentially, we are randomly generating negative examples for the classifier to identify.

Fig 1. Basic concept of AIS type classification

Although it has been used in some cases for applications like credit detection, one of the things you'll notice in AI based algorithms is that there are many other algorithms that might do the same job or better. In the case of trading systems, we might want to identify abnormal time series behavior and make a decision based upon this information, however, it is might be simpler to use statistical control methods to better ascertain the information.

In conclusion, although the idea sounds promising, I haven't seen many practical superior examples that would benefit a trading system that cannot be employed by existing AI algorithms.

artificial immmune systems

There is a wealth of good reading in the linked site, as well as a Weka plugin that can be used to access this algorithm for readers following the Weka tutorials.

Wednesday, February 10, 2010

Using J48 Decision Tree Classifier to Dynamically Allocate Next Day Position in Stocks or Bonds

The prior introduction using a simple model to determine next weeks change based on the S&P 500 index and VIX did not look very promising, although hopefully it served to familiarize yourself with how classification is used in augmenting trading decisions. Wouldn't it be nice if we had something that performed a little better?

Well, let's look at an application of using a decision tree type classification in order to predict whether to invest in stocks or bonds one day ahead of time.
We will use a very simple input model stimulus in order to arrive at a decision.
The following will be used as input attributes.
1) VIX 1 day change
2) TLT 1 day change
3) SPY 1 day change
4) VIX 5 day momentum
5) TLT 5 day momentum
6) SPY 5 day momentum

The VIX is used as a volatility proxy to measure fear, which leads (presumably) to flights to safer instruments (bonds).

The TLT is the iShares Barclays 20+ Year Treas Bond ETF used to track treasury bonds with an average duration of 20 years.

The SPY is an ETF that tracks the general market index: S&P500.

The remaining 5 day momentum attributes are simply nominal attributes of UP or DN used to generally ascertain the momentum of the index over the last 5 days. In addition to the input attributes, we append one output attribute which is the superior instrument to invest in the following day-- SPY or TLT (stocks or bonds). This is what we are trying to predict and decide upon. The training and testing data sample is from the period 7/31/2002 up until present.

By entering the information into Weka (via .csv, see prior tutorials), we will choose the J.48 decision tree learner and use 90%/10% training/test split in order to develop a model tree that will predict which class of instrument to invest in based upon the prior days input stimulus.

Fig 1. Resulting Model Decision Tree

The decision tree can be read from the top down as making a decision based upon certain conditions. I.e. If we traverse the far left branch for example, it would give us the following rule:
IF 5 day SPY momentum is DN and 1 day TLT change is <=.91% and 5 day TLT momentum is UP and 5 day VIX momentum is UP then
invest the next day in SPY.

We can traverse each branch similarly to obtain an all encompassing set of rules to make a decision on what to invest in the following day.
Although the tree looks a bit daunting, if you can program the rule set into your favorite language, it is a simple matter for the algorithm to take that model and process it forward.

Finally, we want to see if the prediction scheme was any better or worse than guessing.

Fig 2. 90/10 split train/validation results of J.48 Model Tree

The results are pretty good. Using a very simple and intuitive model, we were able to select the better instrument to buy with a 59% success rate on the 10% out of sample validation set. The same type of methodology can be used to select between trading systems with a little ingenuity.

Fig 3. Equity Curve comparison of Learner System to investment classes on out of sample data

Finally, we take a look at the equity curve of investing in
1) The results of the classifier system we modeled
2) Investing in SPY or TLT alone (Stocks or Bonds)
3) Investing half in each

Notice the terminal wealth results from our system only slightly beat
all of the other systems. It's a good example of how you might have a good hit rate and only moderate improvement in net results, since hit rate does not account for magnitude. In addition, the costs associated with commission and slippage from trading many times in an out would likely overcome the systematic edge. Later on as we discuss Genetic Algorithms, we will see there are many other ways to optimize.

As always, please do your own due diligence and thoroughly verify any results you may use to make decisions in your own trading.