Developing h5py has been one of the most rewarding programming experiences I’ve had. H5py is a Python library which lets you read and write HDF5 files, which can be used to store all kinds of numerical data from bathymetry information to NASA images of Earth.
It’s also provided a number of lessons about the differences between how I perceive the software and how users actually interact with it.
One of the objects that h5py provides is a Dataset object. It represents an HDF5 dataset, which is like a big multidimensional array which lives on disk. One of the most useful aspects of h5py is that you can perform slicing or partial I/O on datasets; in other words, you can read just the parts you want instead of trying to fit the whole thing in memory. If you have a dataset which is 1000 x 1000 in shape, and you want a little 10 x 10 square somewhere in the middle, you could do e.g.:
data_out = mydataset[300:310,400:410]
H5py will communicate your selection to the HDF5 machinery and only read the 10 x 10 slice off disk. Experienced Python/NumPy users will recognize this as the standard NumPy slicing syntax, which h5py borrows for this purpose. A large amount of code in the h5py “high level” interface is dedicated to supporting this feature.
For historical reasons, Dataset objects in h5py also have a little wart/property named “value”, which when accessed simply reads the entire dataset from disk and dumps it into an array.
Guess which one I generally see people using.
At first I was a little irritated (especially when I got some bug reports relating to poor performance using the “value” property!), but I’m beginning to understand how some of the decisions I made encouraged people to interact with h5py like this. First, some of the earliest example code I released used “.value” to read in data. Other people developed their own code based on the h5py documentation (including the “.value” example), and posted their code online. Second, it’s not obvious to someone sitting at the IPython prompt that Dataset objects support slicing. It is obvious that they have a “.value” property. Third, using .value generally works just fine. Until, that is, you put it into a loop, or try to use it on a 200GB dataset.
Fourth, and back in the distant past, the reason .value was added in the first place, nobody knew how to read data in a scalar dataset! Do you? Create a NumPy scalar array (shape “()”) and try to get something out of it without using item(). Turns out the right way in the NumPy slicing syntax is (and this still looks bizarre to me):
data_out = mydataset[()]
So get the word out! You can slice up Dataset objects just like arrays! Free yourself from .value and the chains of the past!