SUMMARY

How does the performance of Cell, field-programmable gate array (FPGA), and multi-core computers compare for finitedifference modeling of the acoustic wave equation? In this paper I answer this question by assessing implementations on each of these architectures. Results show that on average, 7.49, 5.01, and 3.74 GFLOPs were sustained, respectively, by the FPGA, quad-core, and Cell machines for 2D finite-difference simulations. Optimized multi-core implementations show that the free lunch theorem applies aptly to ‘accelerated’ computing platforms.

INTRODUCTION

The constant-density scalar wave equation was used to simulate wave propagation using a high-order finite difference scheme to demonstrate the utility of these different technologies for seismic processing. A second-order difference was used to approximate the time derivative as demonstrated below in equation 2, and the Laplacian was computed with cascaded eighth-order first derivative approximations. This method facilitates migration to a variable-density scheme in the future. Two benchmarks, which are described below, were run on the FPGA, Cell, and multi-core Opteron machines. The starting point for benchmarking and testing is a straight-forward implementation of the numeric scheme written in ANSI C and compiled with the PathScale compiler. Single precision floatingpoint arithmetic was used for all experiments. As a test of correctness, the impulse response of a homogeneous medium with an acoustic velocity of 1500 m/s was calculated on each platform for a grid of 1536x640 points with a uniform spacing of 5m. A 20 Hz Ricker wavelet was used as a source just below the free surface, at the horizontal center of the model. The simulations were run for 1,000 timesteps covering 1.5s of wave propagation. The impulse response on each platform had to match that of the baseline, accounting for least-significant bit variations that occur when performing floating-point arithmetic on different machines. The impulse response from the baseline code is shown in Figure 1. A subset of the SMAARTJV salt model was used in subsequent experiments. For each architecture, a 4.2 second simulation was run on a 1024x1024 point subset of the velocity model pictured in Figure 2(a). The model has a uniform grid spacing of 7.62m. The same source was used as in the previous experiment, once again centered horizontally just below the free surface. The pressure field after 6,576 timesteps was compared for each architecture against the baseline. The baseline pressure field snapshot is shown in Figure 2(b).

IMPLEMENTATIONS OF FD MODELING

Several techniques, which were often overlapping, were used to optimize the finite-difference solver on each of the target architectures. The machines used for benchmarking and specific optimization techniques are described below for the multi-core Opteron, FPGA and Cell architectures. The benchmarking results are discussed after all the platforms have been introduced.

Multi-core CPU optimizations

Given the amount of effort taken to port the simulation code to the Cell and FPGA architectures, it only seemed natural to explore the hidden potential in the workstation used for baseline timing. The system contains two 2.2 GHz dual-core AMD Opteron processors with 4 GBytes of main memory.

This content is only available via PDF.
You can access this article if you purchase or spend a download.