A Novel FDTD Application Featuring OpenMP-MPI Hybrid Parallelization

Reviewed by Orhun Birsoy
Here, the article "A Novel FDTD Application Featuring OpenMP-MPI Hybrid Parallelization" by Mehmet F. Su, Ihab El-Kady, David A. Bader, Shawn-Yu Lin will be discussed as an example to utilization of parallel computing. The aim of the aforementioned article is to solve the optical characteristics of a novel light source created by a new type of material (photonic crystals) utilizing a Finite Difference Time Domain (FDTD) algorithm.  Photonic crystals affect the motion of photons just as semiconductors affect the motion of electrons. Such crystals have numerous applications in lasers, LEDs, antennas, solar cells etc.  

FDTD is a well known computational method for electrodynamics modeling. Its main advantages are its simplicity and ease of implementation in software. But the biggest disadvantages are its requirement for fine mesh and small time steps, which results in high memory usage and long solution times. This makes it a perfect candidate for parallelization through distributed computing. Although MPI is already being used to solve such FDTD problems, the utilization of OpenMP together with MPI can further improve the efficiency.

Su et al. uses an SGI Origin 2000 distributed shared memory ccNUMA system featuring 350 MHz MIPS R12000 processors. The SGI Origin 2000 is a cache coherent, non-uniform memory access (ccNUMA) distributed shared memory platform.

They have used OpenMP and MPI hybrid to implement the algorithm. MPI only algorithm is presented in Algorithm 1 below. Some parts of the algorithm are threaded using OpenMP. Each MPI process can have several threads managed by OpenMP library. The hybrid algorithm is presented in Algorithm 2. The  main focus is to optimize the OpenMP part of the algorithm since MPI only FDTD solution is a well known subject. In order to achieve a better performance, they utilized several profiling and optimization tools provided with the system they have used but unfortunately the article does not specify the actual name of the tools. These tools were used to measure the cache misses, memory bandwidth used, memory access times, and L1, L2 data cache hit rates. Then these results were used to manually optimize the OpenMP implementation. Since Fortran and C/C++ are the only programming languages that support both OpenMP and MPI, one of these languages must have been used.

Algorithm: MPI-FDTD
Do one time initialization work;
Initialize fields, apply initial conditions;
for t = 1 to tmax do
    for i, j, k = 1 to imax, jmax, kmax do
        Using MPI message passing, exchange magnetic fields with neighbors;
        Update electric fields using magnetic fields;
        Using MPI message passing, exchange updated electric fields with neighbors;
        Update magnetic fields using updated electric fields;
        Update fields at boundaries, apply boundary conditions;
    end
end
Algorithm 1: MPI parallelized FDTD algorithm

Algorithm: Hybrid-FDTD
Do one time initialization work;
Using OpenMP multithreading, initialize fields, apply initial conditions;
for t = 1 to tmax do
    for i, j, k = 1 to imax, jmax, kmax do
        Using MPI message passing, exchange magnetic fields with neighbors;
        Using OpenMP multithreading, update electric fields using magnetic fields;
        Using MPI message passing, exchange updated electric fields with neighbors;
        Using OpenMP multithreading, update magnetic fields using updated electric fields;
        Using OpenMP multithreading, update fields at boundaries, apply boundary conditions;
    end
end
Algorithm 2: MPI-OpenMP hybrid parallelized FDTD algorithm.

Although the article refers to a calculation performed with 24 processors, it only gives the results of runs done with 8 processors. These results show two different OpenMP implementations and their comparisons. First one is a naive implementation of OpenMP such that it almost solely depends on the optimization performed by the compiler (v1). Second implementation uses several manual optimizations using tools available to improve cache usage and reduce the memory foot print (v2). Two figures are provided to compare the manual optimizations with the automatic standard compiler optimizations. Figure 1 shows the comparison of total running times, and Figure 2 shows the speed up comparison between the different optimization methods.

Figure 1 - Execution times on the SGI Origin 2000.Figure 2 - Speedups on the SGI Origin 2000.

These results show that OpenMP, even with standard compiler optimization alone, provides good speedup. It reduces the execution time from over 3 hours to less than an hour with speedup at about 4.8 for 8 processors. With manual optimizations it was possible to slightly improve the efficiency.

The problem used to generate the figures is for a mesh with 188x188x78 grid points and with 3000 time steps.

I found this article especially interesting because it uses OpenMP.

In conclusion, the article tries to combine the strengths of OpenMP and MPI, and achieve shorter run times not possible with a MPI only algorithm. However, the article does not provide any comparisons between a MPI only solution and OpenMP, MPI hybrid solution.

References:
  1. A Novel FDTD Application Featuring OpenMP-MPI Hybrid Parallelization, Mehmet F. Su, Ihab El-Kady, David A. Bader, Shawn-Yu Lin, 2004, Technical Report. ( http://www.cc.gatech.edu/~bader/papers/FDTD.html )
  2. Photonic bandgap materials and photonic crystals
  3. Finite-difference time-domain method