Abstract
With the advent of the multi-core CPU, today’s commodity PC clusters are effectively a collection of interconnected parallel computers, each with multiple multi-core CPUs and large shared memory (RAM), connected together via high-speed networks. Each computer, referred to as a compute node, is a powerful parallel computer on its own. Each compute node can further be equipped with acceleration devices such as the GPGPU (general-purpose graphical processing unit) to further speed-up computational intensive portions of the simulator. Reservoir simulation methods which can exploit this heterogeneous hardware system can be used to solve very large-scale reservoir simulation models and run significantly faster than conventional simulators. Since typical PC clusters are essentially distributed share-memory computers, this suggests the mixed paradigm parallelism (distributed-shared memory) such as MPI-OMP should work well for computational efficiency and memory use. In this work, we compare and contrast the single paradigm programming models, MPI or OMP, with the mixed paradigm, MPI-OMP, programming model for a class of solver method which is suited for the different modes of parallelism. The results showed that the distributed memory (MPI-only) model has superior multi-compute-node scalability whereas the shared memory (OMP-only) model has superior parallel performance on a single compute node. However, it should be pointed out that mixed MPI-OMP model has more efficient memory use in the multi-core CPU architecture than the MPI-only model.
To exploit the fine-grain shared memory parallelism available on the GPGPU architecture, algorithms should be suited to the single instruction multiple data (SIMD) parallelism and any recursive operations are serialized. Additionally, solver methods and data store needs to be reworked to coalesce memory access and to avoid shared memory bank conflicts. Wherever possible, the cost of data transfer through the PCI express bus between the CPU and GPGPU needs to be hidden via asynchronous communication. Recent published data comparing parallel performance for 11 parallel scientific applications indicates that the norm of the performance multiples obtained on the GPUs over the CPUs may only be a small fraction of their quoted peak performance ratios, Table 1. Often times, the achieved performance on a routine or code segments can be significantly better than the overall applications’ speedups which might have generated initial optimism. Of course, what can be achieved will strongly depend on the algorithms and methods of the respective applications.
We applied multi-paradigm parallelism to accelerate compositional reservoir simulation on a GPU-equipped PC cluster. On a dual-CPU-dual-GPU compute node, the parallelized solver running on the dual-GPU Fermi M2090Q achieved up to 19 times speed-up over the serial CPU (1-core) results and up to 3.7 times speed-up over the parallel dual-CPU X5675 results in mixed MPI+OMP paradigm for a 1.728-million-cell compositional model. Parallel performance shows strong dependency on the subdomain sizes. Parallel CPU solve has higher performance for smaller domain partitions whereas GPGPU solve requires large partitions for good parallel performance. This is related to improved cache efficiency on the CPU for small subdomains and the loading requirement for massive parallelism on the GPGPU. Therefore, for a given model, the multi-node parallel performance decreases for the GPU relative to the CPU as the model is further subdivided into smaller subdomains to be solved on more compute nodes. To illustrate this, a modified SPE5 model with various grid dimensions was run to generate comparative results. Parallel performances for three field compositional models of various sizes and dimensions are included to further elucidate and contrast CPU-GPU single-node and multiple-node performances. A PC cluster with the Tesla M2070Q GPU and the 6-core Xeon X5675 Westmere was used to produce the majority of the reported results. Another PC cluster with the Tesla M2090Q GPU was available for some cases and the results are reported for the modified SPE5 problems for comparison.