Many assets exhibit bull or bear trends which persist for long periods of time. This presents an interesting problem for anyone trying to predict the future return of an asset: a lack of diversity in your training set. This problem is known as unbalanced classes in the machine learning field.
The basic issue is that many classification methods work best when your training data is roughly uniform between each class. However, bull and bear markets produce historical training sets in which one class dominates. For example, in a bull market our historical dataset will contain a majority of time periods in which a stock moves up. Likewise, bear markets produce a majority of down periods.
Why do unbalanced classes mess up machine learning models? Because you can’t know the darkness without knowing the light. Consider the extreme case where your training set is 100% one class, e.g. all up days/weeks/months. Now a trivial strategy of buying all the time is our best bet. Any machine learning algo would be hard pressed to beat that level of accuracy.
How do we combat this problem? Admittedly, I am grappling with the problem myself on a daily basis. Like many posts on this blog, this one was initially motivated by my own search for a satisfactory solution. It would seem a popular solution is up-sampling rare classes and down sampling ubiquitous classes until a 50/50 balance is achieved, but that seems like it requires some acrobatics when you try to transfer your model back to the real world, as you’ve significantly altered the training set from what your autonomous trading minion will encounter in real life. Restricting yourself to learning one-class is another method you see in literature and I’ve resorted to in practice. One-class classification can also be useful in quantifying normal vs anomalous behavior.
Another interesting method from this source is to apply a higher confidence threshold to the majority class. So if your training set is a majority of up days, only predict further up movement when your classifier is 75% sure its an up day versus a 50% threshold. The authors also bring up a good point that this can be especially useful when the costs of mis-classification are different for each class. In trading I would venture that this is probably the case: misclassifying an up move could be worse than a down move. This might be due to the stylized fact that down moves in the stock market are quicker and more severe than price increases. Misclassifying a single down day means you miss out on a small increase in price. Misclassifying a single up day could wipe out weeks/months of profit. Thus, applying a higher “hurdle” to your up predictions could make sense in a bull market, especially at the latter stages.
Here are some other papers I’m studying at the moment. Realizing now that we have it easy in finance compared to some unbalanced datasets…
A Survey of Recent Trends in One Class Classification [uwaterloo.ca]
Man vs. Computer: Benchmarking Machine Learning Algorithms for Traffic Sign Recognition [ini.ruhr-uni-bochum.de]