Posts

The Development Process for NetCDF

Image
What Management Style is used for the NetCDF Project? NetCDF is a very successful free software package. It's used by NASA, NOAA, the ESA, and the IPCC, to list just a few. It's a global standard for weather and climate data, and also used in other sciences. As such, the netCDF library is a critical piece of infrastructure for many very important software systems. It has performed reliably and well for more than 20 years, yet it has not remained static. In the last 20 years new features include NetCDF-Java and a whole ecosystem of Java data tools, NetCDF-4 and HDF5 integration, remote data access with OPeNDAP, and the addition of large-file capable binary formats 64-bit offset and CDF5. NetCDF continues to grow and evolve with the hardware and software that make up science data processing systems. It is as relevant for cutting edge computer science today as it was when introduced You may well wonder about the development process used for this very successful project.

Parallel Access to NetCDF Data

Image
Parallel access to netCDF data files can speed read/write times from paralleled code - that is, code that is written with the MPI library to run on (many) multiple cores. How Much Does it Help? The amount of improvement you can get for parallel access depends heavily on your hardware. Running tests on a high-powered, multi-core linux box, I can see improvements of 4X or even greater. On HPC systems with a parallel file system, you should be able to do better. But parallel IO performance (IMHO), does not scale as well as processor performance. On systems with a large number of cores, you will max out the hardward channels to your storage before you are using all the cores for IO. Using a subset of the cores for IO is a good solution, but annoying to program. This functionality is provided by the PIO library (see below). Building NetCDF with Parallel Access In order to use parallel access with netCDF, you must build netCDF correctly. The following must be true: All l

The Different Binary Formats of NetCDF

Image
Under the covers, the netCDF library uses several different binary formats for data. It's important for advanced users to understand the different formats, and use the correct one. The Classic Format Originally there was just one format for netCDF files. It is described in detail in the netCDF documentation (including here ), but it's not necessary to understand the low-level details. Once we introduced the 64-bit offset format, we needed a name for this format, so we call it the "classic" format. The classic format can be read and written by any install of the netCDF library, even if all the bells and whistles are turned off at install. The classic format has limits which make it hard to use on data files larger than 2 GB. The 64-bit Offset Format In order to address the 2 GB limits of the classic format, the 64-bit offset format was contributed by a user, and introduced in netCDF version 3.6.0. The 64-bit offset format changes the classic format very

What is NetCDF, and When Should You Use It

NetCDF is a data format for science and engineering data. It's also a set of free and open software libraries and tools which support the format. NetCDF works better than a database for most science and engineering data, which usually does not fit well into the relational database model. NetCDF has been accepted for decades by the Earth science community to store and share weather and climate data. It is also used by other science and engineering communities. Unique Needs for Science Data Science data needs are different from the data needs of commercial entities, like Google or Amazon. Why Not Use Databases? Databases are all about tables of data. Each table is a two dimensional array of fields. For example, in a table of customer data at Amazon, they will have your customer ID number, which allows them to look up your record, and then they will have fields like "first name", "last name", "street address 1", "street address 2", e

Building NetCDF for HPC

Image
Building netCDF from scratch, on a High Performance Computing (HPC) platform is a challenge. There are a lot of other libraries involved. This diagram shows all the possible 3-rd party libraries in a netCDF build (the yellow boxes): Why Build from Source? Building the netCDF library from source is not usually necessary for most users. On most HPC systems, netCDF will already be install somewhere. Contact your sysadmins and ask how to use it. Sometimes, you need to build from source, because: You need the latest version, and the sysadmins haven't installed it. You need some combination of tools and versions that is not already installed. You are the sysadmin, and you have to build the libraries so that all your users don't have to. You want to have full control and understanding of the build process. Building with Autotools The libraries we need to build all have standard "autotools" build systems. This means that the developers use autoconf