See this page to learn about some recent improvements to BoxLib:
- Tiling for manycore optimization
- calling the HPGMG linear solver
- In situ and in transit data analysis using sidecars
To get a copy of the latest version of the BoxLib repository using git, please visit our Downloads page.
The BoxLib User's Guide is available in the BoxLib git repository in BoxLib/Docs/ (type 'make'). This document contains step-by-step instructions for running simulations in parallel with multiple levels of refinement, with accompanying tutorial applications in BoxLib/Tutorials/.
We have created tutorials in the BoxLib release that give examples of how to use the extensive functionality that is available. Some of the tutorials available in the download describe:
- How to run a simulation using hybrid MPI/OpenMP
- How to customize boundary condition routines
- How to define a fixed multilevel grid structure
- How to run with adaptive mesh refinement
- How to use BoxLib linear solvers to solve a general Helmholtz operator
Summary of Key Features
BoxLib is the block-structured AMR framework that is the basis for many of CCSE's codes.
- Support for block-structured AMR with optional subcycling in time
- Support for cell-centered, face-centered and node-centered data
- Support for hyperbolic, parabolic and elliptic solves on hierarchical grid structure
- C++ and Fortran90 versions
- Supports hybrid programming model with MPI and OpenMP
- Basis of mature applications in combustion, astrophysics, cosmology, and porous media
- Demonstrated scaling to over 200,000 processors
- Freely available to interested users
BoxLib contains all the functionality needed to write a parallel, block-structured AMR application. The fundamental parallel abstraction is the MultiFab, which holds the data on the union of grids at a level. A MultiFab is composed of FAB's; each FAB is an array of data on a single grid. During each MultiFab operation the FAB's composing that MultiFab are distributed among the cores. MultiFab's at each level of refinement are distributed independently. The software supports two data distribution schemes, as well as a dynamic switching scheme that decides which approach to use based on the number of grids at a level and the number of processors. The first scheme is based on a heuristic knapsack algorithm; the second is based on the use of a Morton-ordering space-filling curve. MultiFab operations are performed with an owner computes rule with each processor operating independently on its local data. For operations that require data owned by other processors, the MultiFab operations are preceded by a data exchange between processors. Each processor contains meta-data that is needed to fully specify the geometry and processor assignments of the MultiFab's. At a minimum, this requires the storage of an array of boxes specifying the index space region for each AMR level of refinement. The meta-data can thus be used to dynamically evaluate the necessary communication patterns for sharing data amongst processors, enabling us to optimize communications patterns within the algorithm. One of the advantages of computing with fewer, larger grids in the hybrid OpenMP--MPI approach (see below) is that the size of the meta-data is substantially reduced.
The basic parallelization strategy uses a hierarchical programming approach for multicore architectures based on both MPI and OpenMP. In the pure-MPI instantiation, at least one grid at each level is distributed to each core, and each core communicates with every other core using only MPI. In the hybrid approach, where on each socket there are n cores which all access the same memory, we can instead have one larger grid per socket, with the work associated with that grid distributed among the n cores using OpenMP.
Data for checkpoints and analysis are written in a self-describing format that consists of a directory for each time step written. Checkpoint directories contain all necessary data to restart the calculation from that time step. Plotfile directories contain data for postprocessing, visualization, and analytics, which can be read using AmrVis, a customized visualization package developed at LBNL for visualizing data on AMR grids, or VisIt. Within each checkpoint or plotfile directory is an ASCII header file and subdirectories for each AMR level. The header describes the AMR hierarchy, including number of levels, the grid boxes at each level, the problem size, refinement ratio between levels, step time, etc. Within each level directory are the MultiFab files for each AMR level. Checkpoint and plotfile directories are written at user-specified intervals.
Restarting a calculation can present some difficult issues for reading data efficiently. In the worst case, all processors would need data from all files. If multiple processors try to read from the same file at the same time, performance problems can result, with extreme cases causing file system thrashing. Since the number of files is generally not equal to the number of processors and each processor may need data from multiple files, input during restart is coordinated to efficiently read the data. Each data file is only opened by one processor at a time. The IOProcessor creates a database for mapping files to processors, coordinates the read queues, and interleaves reading its own data. Each processor reads all data it needs from the file it currently has open. The code tries to maintain the number of input streams to be equal to the number of files at all times. Checkpoint and plotfiles are portable to machines with a different byte ordering and precision from the machine that wrote the files. Byte order and precision translations are done automatically, if required, when the data is read.
Here we present weak scaling results for several of our codes on the Cray XT5 Jaguarpf at OLCF. Jaguarpf has two hex-core sockets on each node. We assign one MPI process per node and spawn a single thread on each of the 12 cores. Results are shown for our compressible astrophysics code, CASTRO; the low Mach number code, MAESTRO; and our low Mach number combustion code, LMC. In the MAESTRO and CASTRO tests, we simulate a full spherical star on a 3D grid with one refined level (2 total levels). LMC is tested on a 3D methane flame with detailed chemistry using two refined levels. MAESTRO and LMC scale well to 50K-100K cores, whereas CASTRO scales well to over 200K cores. The overall scaling behavior for MAESTRO and LMC is not as close to ideal as that of CASTRO due to the communication-intensive linear solves performed at each time step. However, these low Mach number codes are able to take a much larger time step than explicit compressible formulations in the low Mach number regime.
The plotfile format generated by BoxLib can be read by VisIt, AmrVis, and yt.