Python and HDF5

Python and HDF5: Unlocking Scientific Data

Download

Introduction

Gain hands-on experience with HDF5 for storing scientific data in Python. This practical guide quickly gets you up to speed on the details, best practices, and pitfalls of using HDF5 to archive and share numerical datasets ranging in size from gigabytes to terabytes. 

Through real-world examples and practical exercises, you’ll explore topics such as scientific datasets, hierarchically organized groups, user-defined metadata, and interoperable files. Examples are applicable for users of both Python 2 and Python 3. If you’re familiar with the basics of Python data analysis, this is an ideal introduction to HDF5. 

+ Get set up with HDF5 tools and create your first HDF5 file
+ Work with datasets by learning the HDF5 Dataset object
+ Understand advanced features like dataset chunking and compression
+ Learn how to work with HDF5’s hierarchical structure, using groups
+ Create self-describing files by adding metadata with HDF5 attributes
+ Take advantage of HDF5’s type system to create interoperable files
+ Express relationships among data with references, named types, and dimension scales
+ Discover how Python mechanisms for writing parallel code interact with HDF5

Over the past several years, Python has emerged as a credible alternative to scientific analysis environments like IDL or MATLAB. Stable core packages now exist for han‐ dling numerical arrays (NumPy), analysis (SciPy), and plotting (matplotlib). A huge selection of more specialized software is also available, reducing the amount of work necessary to write scientific code while also increasing the quality of results.

As Python is increasingly used to handle large numerical datasets, more emphasis has been placed on the use of standard formats for data storage and communication. HDF5, the most recent version of the “Hierarchical Data Format” originally developed at the National Center for Supercomputing Applications (NCSA), has rapidly emerged as the mechanism of choice for storing scientific data in Python. At the same time, many researchers who use (or are interested in using) HDF5 have been drawn to Python for its ease of use and rapid development capabilities.

This book provides an introduction to using HDF5 from Python, and is designed to be useful to anyone with a basic background in Python data analysis. Only familiarity with Python and NumPy is assumed. Special emphasis is placed on the native HDF5 feature set, rather than higher-level abstractions on the Python side, to make the book as useful as possible for creating portable files.

Finally, this book is intended to support both users of Python 2 and Python 3. While the examples are written for Python 2, any differences that may trip you up are noted in the text.

Organization

Chapter 1 Introduction
+ Python and HDF5
+ What Exactly Is HDF5?
Chapter 2 Getting Started
+ HDF5 Basics
+ Setting Up
+ The HDF5 Tools
+ Your First HDF5 File
Chapter 3 Working with Datasets
+ Dataset Basics
+ Reading and Writing Data
+ Resizing Datasets
Chapter 4 How Chunking and Compression Can Help You
+ Contiguous Storage
+ Chunked Storage
+ Setting the Chunk Shape
+ Performance Example: Resizable Datasets
+ Filters and Compression
+ Other Filters
+ Third-Party Filters
Chapter 5 Groups, Links, and Iteration: The "H" in HDF5
+ The Root Group and Subgroups
+ Group Basics
+ Working with Links
+ Iteration and Containership
+ Multilevel Iteration with the Visitor Pattern
+ Copying Objects
+ Object Comparison and Hashing
Chapter 6 Storing Metadata with Attributes
+ Attribute Basics
+ Real-World Example: Accelerator Particle Database
Chapter 7 More About Types
+ The HDF5 Type System
+ Integers and Floats
+ Fixed-Length Strings
+ Variable-Length Strings
+ Compound Types
+ Complex Numbers
+ Enumerated Types
+ Booleans
+ The array Type
+ Opaque Types
+ Dates and Times
Chapter 8 Organizing Data with References, Types, and Dimension Scales
+ Object References
+ Region References
+ Named Types
+ Dimension Scales
Chapter 9 Concurrency: Parallel HDF5, Threading, and Multiprocessing
+ Python Parallel Basics
+ Threading
+ Multiprocessing
+ MPI and Parallel HDF5
Chapter 10 Next Steps
+ Asking for Help
+ Contributing
Share This