Orchestrating a climate modeling data pipeline with Andre R. Erler

Presentation on samedi at 3:10 après-midi to 3:20 après-midi in Room 1160.

In order to run high-resolution regional climate models, it is necessary to interpolate and pre-process large amounts of data from a global climate model at the boundaries of the regional model. Several C and Fortran tools are available in the scientific community to achieve different aspects of this task, but communication between these tools is limited to the filesystem (the program/tool reads input from a file and writes output to a file). In a High Performance Computing (HPC) environment, filesystem access is a bottleneck and temporary files should be avoided.

In this talk I will show how a Python driver module and an in-memory filesystem (RAM disk) can be used to orchestrate the data flow between various tools without creating temporary files on disk and fully automate the entire process. Except for the first input and the last output step, all file I/O is redirected to the RAM disk. The process can also be parallelized in the Python driver module by distributing different input files to different processes using Python multiprocessing. The use of this technique leads to a speed-up of 800% compared to traditional methods, and requires no human intervention.

Different input datasets are supported and new datasets can be added easily due to the object oriented implementation: at every stage of the pre-processing pipeline a dataset method can be overloaded and a different tool can be used, depending on the input dataset. This would not have been possible in a simple scripting language that might otherwise be used to automate such a process.

This module (called PyWPS), is part of the WRF Tools package, a set of Python modules and shell scripts designed to facilitate the operation of a regional climate model (the Weather Research and Forecasting model - WRF) in a HPC environment. It is capable of autonomously running the model over extended periods of time (including automatic crash handling and restarts), automatic pre- and post-processing and archiving.

In the presentation I will first provide some context on regional climate modeling and its computational challenges, before detailing the main design features of the Python WRF pre-processing system (PyWPS).

The package is available on GitHub: https://github.com/aerler/WRF-Tools

Andre R. Erler Bio

Andre is a young researcher and climate modeler; he runs regional and global climate models at the SciNet High Performance Computing facility and analyses their output. He uses Python and its scientific software stack for data handling (or ""data plumbing""), analysis and visualization, and develops tools for these tasks. Andre is also interested in machine learning and the use of data science techniques in and outside of climate science, and is somewhat concerned about the state of software development in science.

He cares deeply about open source software, open science, the environment and sustainable global development.