Recently we were presented with the opportunity to explore a big dataset in the Social Finance space. The size of the dataset was substantial: an order of magnitude greater than the core memory our computers. As a result, we couldn’t employ many of the usual methods for using Python’s Pandas with data that fits in memory. In fact, even if your data fits in core memory, if it is big enough you will still run into trouble using the traditional indexing and storage tools that Pandas provides.
For example, suppose you have a dataframe that is 50-100% of your core memory in size. A familiar way to save this object for later is to store the dataframe in an HDF5 file:
import pandas as pd
df = pd.DataFrame(…)
x = pd.HDFStore(‘store.h5’, ‘w’)
x[‘dframe’] = df
This works fine and good with smaller datasets, but will most likely blow up if your data is large enough (or at least run incredibly slow), even if it fits in memory. We haven’t investigated the cause completely, but it seems that Pandas/PyTables does enough copying to a temporary buffer in the HDF storing process to run out of core memory in the process.
Likewise, trying to read a large dataset from an already created HDF store will cause a MemoryError in Python:
x = pd.HDFStore(‘store.h5’, ‘r’)
df = x[‘dframe’] # throws MemoryError
The solution to both these problems is out of core processing, which is a fancy term for breaking up your work into bite sized chunks that fit nicely in core memory and working on each chunk one at a time.
Thus, for storing a big file, we should break it into chunks and append each chunk to the HDF store sequentially:
x = pd.HDFStore(‘store.h5’, ‘w’, append = True)
for chunk in chunks:
x.append(‘dframe’, chunk, data_columns = True, index = True)
Note the changes here:
- We opened/created the HDF store with append = True, allowing us to write sequentially to the store
- We iterate through each chunk then,
- We use the HDF store’s append function to append the next chunk to the stored dataframe
Note the data_columns = True parameter: this tells the HDF store to allow queries on the dataframe columns. Without this parameter the dataframe stored in the HDF file would not allow queries of any values in the columns. This contrasts with the usual in-memory storage synthax, x[‘dframe’] = df, which automagically creates all the indices necessary for searching. Alternatively, you could pass data_columns = […] a list of column names. This cuts down on the overhead in creating the search indices. Setting it to True just uses every column in the appended dataframe.
There are a couple of options for reading large data files in Pandas. One straightforward method is to read a chunk at a time, process it, then read the next chunk:
x = pd.HDFStore(‘store.h5’, ‘r’)
for df in x.select(‘dframe’, chunksize = 10):
This will process every 10 rows of the dataframe named ‘dframe’ in the HDF file store.h5. You could tune this number depending on the structure of your dataset.
Another way is to pass a PyTables query to the HDF file:
df = x.select(‘dframe’, ‘(column1 == 1.0) & (column2 > 2.5) & (column3 < 10) & …))
If your query is specific enough this will likely return a dataframe which can easily fit in memory. You could process the data then move to the next subset via another query in a select statement.
During our research we also found a library called Blaze which seems like it has some of this out-of-memory functionality built into it. That being said, we discovered this after coding our own solution so it is unlikely we will use Blaze in any production applications.
Once you have created your HDF store, you might find that you want to compress it for storage. With Pandas, reading a compressed file can often be faster than with no compression simply because the time taken for disk I/O is bigger than the time spent decompressing. PyTables provides a handy command line tool for compressing an HDF file called ptrepack:
sudo ptrepack -v –chunkshape=auto –propindexes –complevel=9 –complib=zlib nocompression.h5 compressed.h5
This example will compress nocompression.h5 using Zlib into a new file called compressed.h5. The savings can be substantial depending on the indices you create for searching.
Want to learn how to mine social data sources like Google Trends, StockTwits, Twitter, and Estimize? Make sure to download our book Intro to Social Data for Traders