Last week we presented some evidence that the total daily trading volume for SPY could be predicted from the first minute’s trading volume. We accomplished this using an Archimedean copula, a mathematical construct for modeling multivariate data. Interpreting a copula from an intuitive standpoint can be difficult at first glance, so we thought that we should present an alternative way to estimate trading volume using k-means and logistic regression.
Initially we were motivated to investigate trading volume when we noticed that high volume days tend to be associated with higher volatility. Looking deeper, we saw the relationship between volume and value: high volume days are associated with falling prices and low volume with rising prices.
It would be great if we could classify a trading day as high or low volume. Once we have each day sorted into a class, we could use logistic regression to predict whether today is likely to be high or low volume based on the first minute of trading.
An easy way to cluster a 2-dimensional dataset like this is by using k-means clustering. This algorithm divides our data cleanly into two parts, defining high volume as anything over 100,000,000 shares in a day.
from scipy.cluster.vq import kmeans,vq
data = df.values
centroids,d1 = kmeans(data,2)
hilo,d2 = vq(data, centroids)
The list hilo contains the classes for each trading day: high volume is denoted by 0 and low by 1. We can use this binary data to create a logistic regression which shows the probability of today being a low volume day given the first minute’s trading:
This accomplishes a similar objective to our first article but with the benefit of being easier to interpret intuitively. The breakeven probability occurs at around 1.1 million shares in the first minute, falling rapidly after that. If anything over 1.25 million shares of SPY trade in the first minute we have a high probability of witnessing a high volume day.
In the future we plan to publish further investigation into how this relationship evolves over the day and how we can put this information to work as either a stand-alone strategy or part of an ensemble indicator.
Check out a good tutorial from JustGlowing on k-means with Python.