Book Release: Intro to Social Data for Traders

To mark the release of our first book, Intro to Social Data for Traders, we wanted to provide a small preview of the introductory chapter. This book is intended for traders and investors that want to gain an edge on the competition by incorporating social data into their study of the market. The book assumes you are new to data manipulation in Python, one of the most popular languages for analytics:

Orwell was three decades early

In Nineteen Eighty-Four, George Orwell envisioned a dystopian world in which every aspect of an individual’s life was recorded for the purpose of command and control. It should have been called Two-Thousand Fourteen. In the real 1984, you could buy a newspaper and only the clerk would know it. You could read a book and, unless you announced it to someone verbally, be the only person who knew it.

Today you can buy a subscription to an online newspaper and have your name, address, and many bits of sensitive information put into a database and “monetized,” or sold to the highest bidder. Every time you read an article on your tablet or phone your identity and habits are broadcast to an increasingly diverse group of entities. An entire industry has evolved to predict that you are more likely to buy a plane ticket because you just searched for “Fiji” 5 times and visited some travel sites in the past 3 weeks.

Correlations are popping up in strange places: Alibaba discovered recently that heavier customers spent more money. How did they find out? The bigger a customer’s bra size (based on past purchases) the more they spent on average. This is just the tip of the iceberg; most usage data isn’t properly monetized. Some data gets thrown away after a few months. This won’t be the case for long: money is being mobilized to take advantage of the data deluge. One of the world’s biggest private equity funds is even forming an internal data group to monetize its portfolio companies’ customer data streams.

Luckily our current world is more open than Orwell’s vision. Data is not just available to marketers and governments. The open nature of the internet means that many of these sources of data are available for little to no cost. The data presented in this introduction, for instance, was collected from free resources that can be accessed both manually (via web browser) and programmatically (via Python). This allows a great deal of speed and flexibility in developing trading strategies from social data, as you can rapidly explore new ideas without spending a lot of time or money in the process.

Traders and investors have always endeavored to incorporate the latest and most accurate information into their analyses. Incorporating social data is the natural evolution of the desire to have access to the best information. This book will teach you quick and easy ways to access these new data streams. I will now present several quick examples showing the utility of social data and then describe the structure of the rest of the book.

A new Fear Index

The week of September 14th 2008 was anything but ordinary. A small weekly decline of 1.5% in the S&P 500 belied much deeper problems. This had been a wild week on the stock market: on Monday September 15th Lehman Brothers, a global investment bank founded in 1850, went belly-up in spectacular fashion. Wall Street as a whole had over invested in sub-prime mortgage debt, and the Federal Reserve had stepped in to support markets. Only six months earlier another investment bank, Bear Stearns, required a coordinated bailout between the public and private sectors. In that context, the small decline in the stock market may have seemed appropriate.

Beneath the surface, however, serious concerns were percolating within the collective consciousness of the market. Compared to the week before, Google search volumes for the term “stock market” were over 3 times higher. This reflected a deep uncertainty about the future trajectory of the stock market. Individual traders and investors were concerned about how to interpret the news that another firm had gone bankrupt, and they sought information from the internet to inform their decision-making process for the week ahead.

They were right to be concerned. Over the next three weeks the S&P 500 declined almost 30%. During that time people continued to seek out information from the internet for guidance on how they should respond to the market turmoil. As a result, the volume of searches for the term “stock market” peaked at over five times what they had been before Lehman Brother’s bankruptcy. Instead of turning to professional advisors or colleagues, individual investors were asking Google for answers, and you could watch search volumes spike with each gyration in the stock market.

Then the rush of searches began to fade just as quickly as it had begun: by the time the stock market was making new lows on November 20th, search volume had fallen over 70% from its peak. Fear was fading, but if you had only looked at stock prices you would have thought that the market was in collective panic. A month later, the S&P 500 was 20% higher. Search volumes declined before prices and volatility began to stabilize. Just as the rise of searches was a reflection of an underlying panic, the decline represented an easing of fear. It represented an opportunity to profit as others panicked.

Unlike other market anomalies, the correlation of volatility with search volume hasn’t been eroded by the passage of time. In 2014 alone, the search volume for the term “stock market” was 75% correlated with the Chicago Board Options Exchange (CBOE) S&P 500 Volatility Index (VIX) – the traditional “Fear Index”. Moreover, since 2008 changes in search volumes have presaged changes in the VIX with startling regularity, as the following figure makes clear.

Learn to correlate search volumes to market movement with MKTSTK's eBook "Hacking Social Media for Trading" see free signup below

Advertisers and businesses were the first to realize the value in using social media data to shape their strategies. Government spooks might have been next on board. Financial markets have been slower to adapt. After reading this book you will join the small minority of investors and traders who possess the know-how to access these new data sources. If you fail to incorporate all the relevant data into your analysis, it leaves you open to being blind-sided by an unexpected turn of events. Looking at social data transforms that risk into an opportunity.

Monitoring the Rise of Infoterrorism

On April 23rd 2013 hackers gained control of the Associate Press Twitter account and broadcast the following message to tens of thousands of followers:


Within minutes the S&P 500 index had declined by 1%, erasing about $136 billion in market value. Prior to the tweet, hackers had made repeated attempts to steal the passwords of AP reporters. This time they hit the jackpot. The AP alerted Twitter and the account was suspended. The AP opened another account and stated that the @AP announcement was bogus.

Markets quickly recovered, even finishing higher on the day. Anyone who was aware that the tweet was a hoax stood to make a quick profit as markets reversed their initial decline. Those who sold based on false information had no recourse but to buy their positions back at higher prices. However, markets had just caught their first glimpse of a dangerous new form of terrorism: one that manipulates and disseminates false information. This new breed of infoterrorists increasingly utilizes social media as a transmission vector for their false data. Watching every source of information manually is inefficient if not impossible to do effectively, however; this is why we must employ automatic tools to monitor the social data stream.

Luckily for us, Twitter and other social networks provide programmatic access to their public data feeds which allow us to automate much of the process of watching. This book shows you how to access the stream of Tweets programmatically. This information will allow you to build a computer program to watch the tweet stream for you.

Who is this Book’s Intended Audience?

This book is intended for traders and investors that want to gain an edge on their competition by incorporating social datasets and search analytics into their study of the market. While the Setting up your hacking environment chapter contains an intro to Python, this is not an intro to programming per se and it would be ideal if you had some experience with a programming language. We assume you know basic programming concepts like loops, functions, variables and types.

However, I am assuming you are new to data manipulation in Python and I have written a focused introduction to get you up and running as quickly as possible (relatively, speaking, learning new things is always a challenge). The majority of the book’s examples are written in Python, one of the most popular languages for data manipulation. This book introduces you to the capabilities of the Python analytics stack including NumPy, Pandas, PyPlot and more. I also included two examples written in R, another popular open-source language for data analysis.

I have attempted to cover a large breadth of topics, providing you with the seeds of knowledge necessary to access social data sources. Coding examples will teach you how to build the interfaces between your computer and these novel datasets. You will learn how to extract historical search volumes and programmatically access the Twitter stream, just to name a few examples.

Given the breadth of topics covered, do not expect each chapter to be encyclopedic regarding every aspect of a social data source. Instead we are looking to cater to traders who want the highest level of abstraction. In other words, this is a book written for traders by a trader: you will learn efficiently because your time is a non-renewable resource.

To gain fully from the knowledge this book has to offer you must be interested in the infinite possibilities that arise from including social datasets in your study of the market. This book will not provide cookie-cutter trading strategies. You must be willing to learn new topics, to download a myriad of open-source software to your computer and possibly learn a bit about system administration. While this book assumes very little programming knowledge, it is hard to write for every demographic so it is conceivable that you should be willing to supplement the information in this book with external programming tutorials and web-searches.

After you learn the basics in this book, it is my hope that you will be interested in extending the functions provided to suit your own individual needs. It is likely that after you master the examples below, you will have more questions than ever. While I have written some examples demonstrating the use of Natural Language Processing (NLP), this is not an in-depth guide to NLP in Python. These guides are not hard to find on the internet and for a list of recommended books on NLP check out:

For specific coding questions and errors, StackExchange is an invaluable resource as many problems you encounter will already be addressed by users of this site. Also, before coding something big from scratch, its always worth searching Google for a Python package that could save you days of coding. We make use of many of the existing modules to make our lives easier, and you should too.

Book Structure

Social data comes from an increasing number of sources. This book will give you the tools necessary to access these valuable datasets. In the first section, we introduce the reader to the various data sources available:

  • Google Trends: a database of historical search volume for a variety of search terms going back to 2005
  • Estimize: a platform for crowd-sourcing financial predictions such as the next non-farm payrolls number. Users can also submit earnings predictions for individual stocks. In a coding example we show evidence that Estimize’s crowdsourced estimates are generally more accurate than Wall Street forecasts.
  • StockTwits: a social network for sharing trading ideas in stocks and futures. Similar to Twitter in many ways in message structure. Users make references to financial instruments via “cash-tags” which are similar to Twitter hashtags. StockTwits provides public feeds for individual stocks allowing you to monitor the feed programmatically.
  • Twitter: short message-based social network used by multiple types of information sources. As we saw before, news organizations use Twitter to broadcast breaking news to a large audience. Additionally, Twitter has become a conduit for monitoring wars and protest movements in real-time. Even politicians use the network to communicate with the public.

The next chapter provides a guide for setting up your coding environment. This teaches you how to install Python and R as well as the individual coding packages used in the book’s examples.

In the last section, we present a series of chapters containing the code to get you started collecting and analyzing these new sources of data. These applications include:

To read the rest of the book, you can buy it now on Amazon: 

4 replies »

  1. My initial question in reading the following would be: how can any of this be anything but a trailing indicator that doesn’t already reflect in the price? What’s the value of knowing if people are talking about x, since they likely are because something has happened.

    I would love to see value in this approach but I’m still not convinced it’s not all noise.

    Liked by 1 person

    • the important thing about the weekly search data we present here is that we comparing last week’s search volume to variables like volume, vol, etc this week. perhaps we should have made the point a little clearer. Google Trends returns weekly data by default (time stamped on a saturday, i.e. monday thru saturdays average search volume is reported saturday), so we wanted to be conservative in our reporting. When we merge the time series for price data and search data, we use the value reported on saturday to correlate with price data for the next trading week. Thus our results use search data to forecast / predict trading statistics, they are not just coincident correlations.

      For instance, our posts about trading TSLA with Google Trends and correlating search volume with stock prices use this methodology.


  2. This is a GREAT book!! Well worth the money. It contains exceptional domain knowledge that will save your hour of searching the internet looking for code. Even if you only know a tiny bit of python, the examples are very easy to replicate.

    A few minor things to note when using the book:

    * Some of the indentations (vital in python) are hard to follow in the two column be careful when typing. Cutting and pasting may not work. So the book is actually helpful in it forces you to think through the code!

    * The StockTwit api no longer offers the [‘user’][‘follower’] key value pair. Follower info is now part of the partner API. So that code doesn’t work.

    Still — a fantastic text! You can have a very nice trading machine going in no time! Thank you MKTSTK for sharing this knowledge and the code!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s