Suppose we have an unstructured block of text. We want to reduce the information contained within the raw words and phrases into a single score of whether it is positive or negative.
For example, suppose we have a tweet and we want to determine whether it is bullish, bearish, or neutral. Semantic Orientation (SO) is the use of natural language processing (NLP) techniques to programmatically assign a value to an arbitrary chunk of text that shows how it is related to two opposite reference terms such as bullish/bearish or dovish/hawkish.
One method for assigning SO to a phrase is called Pointwise Mutual Information and Information Retrieval (PMI-IR). This was first proposed by Peter D. Turney in Mining the Web for Synonyms in 2002, but it can also be applied to a range of NLP tasks. The basic formula for assigning SO to a word is:
SO( words , pos, neg) = ln ( A ( words, neg, pos ) * B (pos, neg) )
A ( words, pos, neg ) = number_of_results ( words NEAR pos ) / number_of_results ( words NEAR neg )
B ( pos, neg ) = number_of_results( pos ) / number_of_results( neg )
- words is any unstructured text string
- pos is the positive pole, e.g. Bullish
- neg is the negative pole, e.g. Bearish
- x NEAR y is an operator which returns true if x is within a certain number of words of y
- number_of_results is a function that returns the number of documents where its argument is true
This function varies between positive and negative values, where the more positive the SO the stronger its orientation towards the positive pole (more negative values orient more towards the negative pole)
The problem is PMI-IR is that you need a large database of text in order to calculate the number of results for each query. One option would be to roll your own, so to speak. You could download a large number of tweets, news articles, blog posts and then calculate the number of results from this database. As this takes quite a bit of work (or money, you could buy a corpus from Reuters or some other provider) it would be nice if there was a quick and dirty option.
Luckily, the solution comes from a 2011 paper from David Lucca of the Federal Reserve Bank of New York, where he proposed using Google searches to calculate the number of hits. To do this, we make use of an undocumented search operator, AROUND(…), which takes the place of the NEAR operator above. For example “AAPL” AROUND(10) “bullish” returns all results where AAPL is within 10 words of bullish.
In this case we run into an API issue: while there are Google search API’s in existence, none of them return a proper number of hits for a given search term. For the purposes of computing the Google Semantic Orientation scores (GSO’s), we aren’t interested in the search results themselves, just the number of results that appears at the top of every search page (e.g. “About 935,000,000 results”). We could compute the GSO’s manually, by performing repeated searches and writing down the results. But that would be slow and, more importantly, riddled with data entry errors. This holds back research in the area, but we can jerry-rig together a solution with some simple tools in Python.
We will construct a new class to compute GSO’s. First we need to make sure we import the required packages…
Want to get the full code? This is a preview from our new book Intro to Social Data for Traders, which contains the code for the GSO class and much more, including how to correlate Google search data with financial markets, how to monitor the StockTwits and Twitter streams, and how to use new financial prediction networks like Estimize.
…Now we can create an instance of the GSO class and estimate the semantic orientation of some different words. To get started, save the code for the class in a file called “gso.py”. Now open a python shell in the same directory and type:
from gso import *
This will compile your class into a pyc file. Alternatively you could type:
This just runs the code within gso.py just like copying and pasting it into the python shell. Either way, now you can create a GSO instance with:
stocks = GSO(up = “bullish”, dn = “bearish”, ref_text = “stock market”)
In this case we have defined an instance with “bullish” and “bearish” as its opposite poles. We have given the ref_text variable a value of “stock market” in order to filter out search results which don’t relate to the stock market.
Now we can compute some GSO’s for various stock symbol’s as our target search terms:
stocks.SO(“RUSS”) # output: -1.41
stocks.SO(“TSLA”) # output: -1.75
You can also make the target search term an arbitrary block of text, such as:
stocks.SO(“considerable time”) #output: 0.43