What is NetCDF, and When Should You Use It

NetCDF is a data format for science and engineering data. It's also a set of free and open software libraries and tools which support the format.

NetCDF works better than a database for most science and engineering data, which usually does not fit well into the relational database model.

NetCDF has been accepted for decades by the Earth science community to store and share weather and climate data. It is also used by other science and engineering communities.

Unique Needs for Science Data

Science data needs are different from the data needs of commercial entities, like Google or Amazon.

Why Not Use Databases?

Databases are all about tables of data. Each table is a two dimensional array of fields. For example, in a table of customer data at Amazon, they will have your customer ID number, which allows them to look up your record, and then they will have fields like "first name", "last name", "street address 1", "street address 2", etc.

These are all the fields that you filled out on various web forms when you signed up for amazon services.

Sometimes tables will have sub-tables. In fact, there are multiple addresses associated with each customer, so instead of storing your street address in the customer table, Amazon propably has an address table, which also has your customer ID, and an address ID, and a record for every address you ever entered.

Multiple Dimensions

Science data is not like customer data. With customer data there is a lot of emphasis on the relationship between pieces of data. These relationships are packaged into tables and ID fields, and using the ID fields and the tables, you locate any particular datum.

In science typically more than two dimensions are involved. An atmospheric scientist will be calculating data in four dimensions (latitude, longitude, vertical level, and time). Storing 4D data in tables is laborious and would result in slow read access.

NetCDF has much better support of multiple dimensions than databases. With netCDF, the array in the C/Fortran/Java program can be handed directly to the netCDF API, which will store it without breaking it into tables.

Reading ALL the Data

Few commercial databases see the same read access that science data does. No one ever sends a web request to amazon asking for all data from a million products at once. Instead, users ask about one or two products in a vast database of billions of products. The database is optimized to find and return that one record quickly.

But it is not good at returning millions of records. Tiny delays, like those caused by copying around data in memory, are magnified by the quantity of records needed.

Science users frequently read all the data, not just tiny sub-samples.

Sharing Data

Scientists want to share data. Science is a global enterprise, and data needs to be shared around the world. NetCDF is a well-defined and accepted global standard.

Archival and Stewardship

Science data is for the long term. If Amazon goes out of business, no one will really mind that their giant product database is lost.

But if NOAA goes out of business, scientists present and future will still need to be able to read all the science data produced by the organization. So the data must be able to outlive the organization that produces it.

Since netCDF is a free and open source project, users can always maintain access to their data.

NetCDF also provides good mechanisms for annotating data with metadata. 

Much More Data

In science data it's usually more about quantity of data. Instead of storing an address, which is a couple of hundred bytes of data at most, the scientist is going to store billions of bytes of data.

Commercial database systems simply cannot handle the quantity of data needed by today's models and instruments. Terabytes of data are the new normal.


Comments

Popular posts from this blog

Building NetCDF for HPC

Parallel Access to NetCDF Data