High-performance computing is at the heart of digital technology which allows to simulate complex physical phenomena. The current trend for hardware architectures is toward heterogeneous systems with multi-core CPUs accelerated by GPUs to get high computing power. The demand for fast solution of Geoscience simulations coupled with new computing architectures drives the need for challenging parallel algorithms. Such applications based on partial differential equations, requires to solve large and sparse linear system of equations. This work makes a step further in Matrix Powers Kernel (MPK) which is a crucial kernel in solving sparse linear systems using communication-avoiding methods. This class of methods deals with the degradation of performances observed beyond several nodes by decreasing the gap between the time necessary to perform the computations and the time needed to communicate the results. The proposed work consists of a new formulation for distributed MPK kernels for the cluster of GPUs where the pipeline communications could be overlapped by the computation. Also, appropriate data reorganization decreases the memory traffic between processors and accelerators and improves performance. The proposed structure is based on the separation of local and external components with different layers of interface nodes-due to the MPK algorithm-. The data is restructured in a way where all the data required by the neighbor process comes contiguously at the end, after the local one. Thanks to an assembly step, the contents of the messages for each neighbor are determined. Such data structure has a major impact on the efficiency of the solution, since it permits to design an appropriate communication scheme where the computation with local data can occur on the GPUs and the external ones on the CPUs. Moreover, it permits more efficient inter-process communication by an effective overlap of the communication by the computation in the asynchronous pipeline way. We validate our design through the test cases with different block matrices obtained from different reservoir simulations : fractured reservoir dual-medium, black-oil two phase-flow, and three phase-flow models. The experimental results demonstrate the performance of the proposed approach compared to state of the art. The proposed MPK running on several nodes of the GPU cluster provides a significant performance gain over equivalent Sparse Matrix Vector product (SpMV) which is already optimized and provides better scalability.

You can access this article if you purchase or spend a download.