MMT: A Monitoring and Management Tool for Grand Challenge Supercomputer Simulations

Background: Current cosmological N-body simulations are routinely run with 10's to 100's of billions of particles, tracking the growth of structure within large cosmic volumes. These simulations provide the foundation for synthetic multi-wavelength galaxy surveys, which provide the theoretical framework for interpreting observational data from next generation telescopes such as the SKA, Euclid, the ELTs, etc. By coupling simulations to state-of-the-art galaxy formation models to predict the galaxy population properties as a function of cosmic time, it is relatively straightforward to generate synthetic observables (e.g. stellar and gas masses, spectral energy distributions) for millions to billions of galaxies over an arbitrarily long baseline in redshift. This process can be automated into a pipeline, but as we move to larger datasets, this pipeline requires more frequent user intervention to ensure quality control, partly because of the fragility of HPC facilities when running cutting-edge simulations, partly because of the complexity of the process itself. For this reason, we propose the development of a monitoring and management tool (MMT) for grand challenge astrophysical simulations.

Significance: The MMT is designed to provide a simple and efficient remote user interface to track large-scale simulation projects as they run on HPC facilities. Conceptually, it splits into two parts; (1) assuming a userdefined set of codes in a pipeline, generating and validating the parameter sets and submission scripts, and launching scripts, for each stage of the pipeline, and (2) tracking both code performance (e.g. load balancing, I/O) and facility performance during the different stages of the pipeline, ensuring quality control of the data products as well as optimal usage of the HPC facility. At this point, optimal usage (e.g. distribution of cores and memory per node) can be experimented with by the user by appropriate benchmarking, but future versions of the MMT could determine what is optimal using algorithms that can dynamically characterise distributed workloads. The MMT will be accessed via a secure web portal. The importance of the MMT derives from development of a tool to ensure that projects that require significant HPC resources - both compute and storage - fully exploit these resources.

Innovation and Impact: The MMT as we envisage it will be applied to our cosmological N-body simulations and galaxy formation models, but it is at core a generic tool for managing and monitoring any pipeline of multiple distinct codes run on an HPC facility. Importantly, the flexibility and extensibility of the tool means that it can be developed to be increasingly sophisticated in how it monitors the codes and the facility (e.g. dynamically redistributing workloads as runs progress to ensure optimal load balancing). From our perspective, the MMT will make it significantly easier for us to optimise our usage of HPC facilities and increase the efficiency of how we monitor and manage the very large datasets data. Having an easy-to-use standardised MMT that can be applied to such a pipeline fills a gap in the scientific HPC community, and so we anticipate that the MMT should be made available to the wider community.

Client

Contact Person: A/Prof Chris Power Telephone: 08 6488 7630
Email: [email protected]
Preferred method of contact: In person, but e-mail is fine
Location: Ground Floor of 7 Fairway

Client Unavailability

A/Prof Power will be unavailable August 21-25, October 2-6; Dr Elahi can cover during these periods.

IP Exploitation Model

The client wishes to use a Creative Commons CC BY-NC model to deal with IP embodied in the project.