How to save & load large pandas dataframes
Update (April 2018): Use feather format.
How to save & load large pandas dataframes
I have recently started using Pandas for many projects, but one feature which I felt was missing was a native file format the data. This is especially important as the data grows.
Numpy has a simple data format which is just a header plus a memory dump, which is great as it allows you to memory-map the data into memory. Pandas does not have the same thing.
After looking at the code a little bit, I realized it's pretty easy to fake it though:
The data in a Pandas DataFrame is held in a numpy array.
You can save that array using the numpy format.
The numpy code does not care about the file beyond the header: it just maps the rest of the data into memory.
In particular, it does not care if there is something in the file after the data. Thus, you can save the Pandas extra-data after the numpy array on disk.
I wrote this up. Here is the writing code:
np.save(open(fname, 'w'), data) meta = data.index,data.columns s = pickle.dumps(meta) s = s.encode('string_escape') with open(fname, 'a') as f: f.seek(0, 2) f.write(s)
We save the array to disk with the numpy machinery, then seek to the end and write out the metadata.
Here is the corresponding loading code:
values = np.load(fname, mmap_mode='r') with open(fname) as f: numpy.lib.format.read_magic(f) numpy.lib.format.read_array_header_1_0(f) f.seek(values.dtype.alignment*values.size, 1) meta = pickle.loads(f.readline().decode('string_escape')) frame = pd.DataFrame(values, index=meta[0], columns=meta[1])
Check out this gist for a better version of these, which also supports pandas.Series.
As an added bonus, you can load the saved data as a numpy array directly if you do not care for the metadata:
save_pandas('data.pdy', data) raw = np.load('data.pdy')
Note: I'm not sure this code covers all Pandas cases. It fits in my use case and I hope it can be useful for you, but feel free to point out shortcomings (or improvements) in the comments.