Abstract
To realize the potential of the latest High-Performance-Computing (HPC) architectures for reservoir simulation, scalable linear solvers are necessary. We describe a parallel Algebraic Multiscale Solver (AMS) for the pressure equation of heterogeneous reservoir models. AMS is a two-level algorithm that employs domain decomposition with a localization assumption. In AMS, basis functions, which are local (subdomain) solutions computed during the setup phase, are used to construct the coarse-scale system and grid transfer operators between the fine and coarse levels. The solution phase is composed of two stages: global and local. The global stage involves solving the coarse-scale system and interpolating the solution to the fine grid. The local stage involves application of a smoother on the fine-scale approximation.
The design and implementation of a scalable AMS on multi- and many-core architectures, including the decomposition, memory allocation, data flow, and compute kernels, are described in detail. These adaptations are necessary to obtain good scalability on state-of-the-art HPC systems. The specific methods and parameters, such as the coarsening ratio (Cr), basis-function solver, and relaxation scheme have significant impact on the asymptotic convergence rate and parallel computational efficiency.
The balance between convergence rate and parallel efficiency as a function of the coarsening ratio (Cr) and the local stage parameters is analyzed in detail. The performance of AMS is demonstrated using heterogeneous 3D reservoir models, including geostatistically generated fields and models derived from SPE10. The problems range in size from several-million to 128-million cells. AMS shows excellent behavior for handling a fixed-size problem as a function of the number of cores (so-called strong scaling). Specifically, for a 128-million cell problem, a speed-up of nine-fold is obtained on a single-node 12-core shared-memory architecture (dual-socket multi-core Intel® Xeon® E5-2620-v2), and more than twelve-fold on a single-node 20-core shared-memory architecture (dual-socket multi-core Intel® Xeon® E5-2690 v2). These are encouraging results given the limited memory-bandwidth that cores can share in a single node, which tends to be the major bottleneck for truly scalable solvers. We also compare the robustness and performance of our method with the parallel SAMG solver from Fraunhofer SCAI.