Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems
Ashay Rane and Dan Stanzione Ph.D.
{ashay.rane, dstanzi}@asu.edu
Fulton High Performance Computing Initiative, Arizona State University
Abstract
The Hybrid method of parallelization (using MPI for inter-node communication and OpenMP for intra-node communication) seems a natural fit for the way most clusters are built today. It is generally ex- pected to help programs run faster due to factors like availability of greater bandwidth for intra-node communication. However, optimizing hybrid applications for maximum speedup is difficult primarily due to inadequate transparency provided by the OpenMP constructs and also due to the dependence of the resulting speedup on the combination in which MPI and OpenMP is used. In this paper we mention some of our experiences in trying to optimize applications built using MPI and OpenMP. More specifically, we talk about the different techniques that could be helpful to other researchers working on hybrid applications. To demonstrate the usefulness of these optimizations, we provide results from optimizing a few typical scientific applications. Using these opti-mizations, one hybrid code ran up to 34% faster than pure-MPI code.
1 Introduction
MPI (the Message Passing Interface) and OpenMP are the de-facto standards when writing parallel programs. MPI provides an explicit messaging model with no assumption of shared memory, and as such is the standard for use on distributed memory systems. OpenMP provides a threading model, with implicit communication and the assumption of shared memory, and therefore is often used on shared memory systems. Hybrid programs are those that use both MPI and OpenMP – MPI for communication between nodes and OpenMP for communication within a single node. For the last several years, most clusters have been composed of a collection of multi-core nodes, with shared memory at the
node level, and distributed memory between nodes. This appears to fit well with the hybrid model. The other, more compelling reason of using both MPI and OpenMP is that communication by means of shared memory is known to support
much greater bandwidth as opposed to communication using messages [1]. There have been efforts [2] to make MPI leverage the shared memory architecture on a node. However, from our experience hand-tuning using OpenMP gives greater benefits than relying on the techniques in the MPI distribution alone. Similarly, efforts like those extending OpenMP to clusters [3] failed to give performance equivalent to that of hand-tuned MPI+OpenMP code...
Continue reading by following the download link below:
| Attachment | Size |
|---|---|
| Experiences in Tuning Performance of Hybrid MPI - OpenMP Applications on Quad-core Systems.pdf | 179.68 KB |
