Chris Bunch @ UCSB

Multigrid Methods with CUDA

Over the last quarter I have been investigating how to implement a multigrid algorithm with CUDA, a new language aimed at making General Purpose Computation on GPUs (GPGPU) easier. This web page documents this experience.

Update 03/11/08: Clarified some points in response to user comments.
Page Layout
  • Introduction
  • Algorithms and Technologies Used
  • Design Process
  • Results
  • Pros and Cons of CUDA

    Introduction

    It is not always possible to solve a partial differential equation (PDE). In fact, exact solutions are not known for most PDEs. With the advent of computers, however, we can obtain good approximations to these solutions. One class of methods that are used to solve PDEs are the Multigrid Methods. Compared to other types of methods used to solve PDEs, they require less iterations and thus are generally faster. For this project, we have elected to solve a common PDE, the heat equation in three-dimensions, given by the following equation:


    Implementing multigrid methods is not a new problem. Therefore, we choose to implement them on a new platform, the Graphics Processor Unit (GPU). This has become possible only recently with the advent of GPGPU languages such as NVIDIA's CUDA (Compute Unified Device Architecture) and ATI's CTM (Close To Metal), amongst others. Since we already have an NVIDIA GeForce 8800 in our lab, which is CUDA-capable, we used CUDA to implement this project.

    Algorithms and Technologies Used


    CUDA

    CUDA is a C-like language that allows programmers to "easily" port their code for use on the GPU. The language is mostly a super-set of the C language, allowing the programmer to flag which methods are run on the GPU and where variables are stored on the GPU. Although it is presented as a high-level language, low-level parts of the architecture are exposed to the user. As is the case with C, the programmer can allocate memory and de-allocate it (via cudaMalloc and cudaFree), but the programmer can choose specifically on the GPU where they want to put it. The following figure is from the CUDA documentation, showing the various memory locations available to the user:


    Here, we can see that we can allocate memory in the graphics card's shared memory, constant cache, texture cache, and device memory. Although the device memory is the slowest to access, it has the most space and allows us to not have to worry about cache coherence in the shared memory. Thus, we use the device memory. A critique of the CUDA language is given in the "Pros and Cons of CUDA" section.

    Multigrid Methods

    Typical methods used to solve PDEs represent the simulated area as a three-dimensional grid (or matrix). Therefore, it is easy to represent on a computer. Furthermore, we can represent the heat equation as solving a linear system of equations. The general and specific representations are as follows:


    For our implementation, k is 1, A is the Laplace operator, x is u, and b is Ut. Three simple methods exist for solving this system of equations: The Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. Our explanation for choosing the Red-Black Gauss-Seidel Method is given later, since this section focuses on Multigrid Methods in general.

    As was previously stated, regular methods use one grid and solve it using any of the aforementioned methods. Multigrid methods differ in the sense that they solve many smaller grids and use those as approximations to the solution of the full size grid. This is necessary because solving only one large grid does not give the fine-grained accuracy that solving the smaller grids does. Furthermore, solving many small grids once and the large grid a small number of times is computationally faster than solving the large grid many times. The methods needed to implement this will be described in the design section.

    Red-Black Gauss-Seidel Method

    The simple methods used to solve a linear system of equations are the Jacobi Method, the Gauss-Seidel Method, and the Red-Black Gauss-Seidel Method. As shown in A Multigrid Tutorial, the Red-Black Gauss-Seidel method converges the fastest out of these three methods (requires the least number of iterations). An added benefit specific to CUDA is that this method is easily parallelizable. In the standard Gauss-Seidel method, grid points must be updated one at a time. Although this works fine for serial computation, it is a poor choice for the parallel case. For Red-Black Gauss-Seidel, the odd points are all updated in parallel, and then the even points are all updated in parallel. The computation used to update each point is given by the average of that point's neighbors.



    Design Process

    This being my first academic project since reading Code Complete, I decided to develop a prototype first. My C++ prototype simulates a 2D grid situation. It was mainly used to get a feel for the methods needed for the CUDA version, and as a result, is incomplete. It ended up mapping over to CUDA well, as follows:

    C++ Method Name CUDA Method Name Purpose
    Constructor Initialize / MG_CUDA_AllocateMem Creates a grid. The C++ version needs a grid size; the CUDA version creates all the subgrids from the given size to a preset limit. Each subgrid is half the size of the previous grid.
    Destructor Close / MG_CUDA_Free Frees the memory a grid has allocated. The CUDA version uses cudaFree to release all the memory it has been given via cudaMalloc.
    setupInitialConditions HeatToggle The grid is initialized to 0 degrees at all points by the constructor. The C++ version of this method sets all points in the first row to be 100 degrees, and the CUDA version sets the innermost area of the cube to be 200 degrees. This is done so that we will be able to see the heat flow throughout the geometry.
    restrictGrid MG_CUDA_RestrictGrid / MG_CUDA_RKernel Creates a new grid half the size of the original grid. The C++ version copies down the points from the bigger grid to the smaller one, while the CUDA version sets each point to be the weighted average of the points around it.
    interpolateGrid MG_CUDA_InterpolateGrid / MG_CUDA_IKernel Creates a new grid twice the size of the original grid. Both versions sets each point to be the weighted average of the points around it.
    solveGrid MG_CUDA_SolveGrid / MG_CUDA_SGKernel Both versions solve the given grid by using the Red-Black Gauss-Seidel Method. The odd numbered points are updated to be a weighted average of the points around them, and the process is repeated for the even numbered points. The C++ version does this until 50 iterations are done, while the CUDA version does 1 iteration.
    computeResidual MG_CUDA_ComputeResidual / MG_CUDA_CRKernel Takes the difference between each point in two grids and returns a new grid containing this difference.
    applyCorrection MG_CUDA_ApplyCorrection / MG_CUDA_ACKernel Adds in each point in a grid to the given grid and returns a new grid.
    printGrid {none} Used for debugging purposes only while the visualization scheme was being developed.
    main Simulate Performs one iteration of our multigrid solver, which consists of restricting the grids with data from the previous iteration, solving the smaller grids, and solving the largest grid.

    The code for the multigrid prototype can be downloaded here.

    Results

    For the sake of timing purposes, we consider each execution of the 'Simulate' method to be one step, and we time how many steps can be performed per second. We test this for both the CUDA and C multigrid implementations for grids of size 16 x 16 x 16 through 256 x 256 x 256. We attempted to go larger (to 512), but the CPU's memory allocation function (malloc) failed and could not allocate us the 512 MB of memory we needed for each of the four arrays we manipulate on the CPU (note the 512 MB matching up with the grid size 512 is coincidental). That being said, the speedup we achieve is given as follows:


    For the purposes of this experiment, the % difference is also the speedup between the CUDA and C versions of the multigrid method.

    What is interesting to note is that the CPU version is actually faster on the grid of dimension 64, and that the GPU is not notably faster as we have seen in other CUDA GPU applications. The reason for this is two-fold. First, we were unable to resolve issues we were having getting the Red-Black Gauss-Seidel method to work on our largest grid, so it had to be done on the CPU, and the integer modulo operator we use to get the even and odd points for Red-Black Gauss-Seidel works very fast on the CPU and very slow on the GPU. (Update 03/11/08: Added bold to emphasize that this project is not entirely GPU-based. Furthermore, although we have been told by Brent that integer modulo is slow, other CUDA groups report that it is fast, and thus we are uncertain) Furthermore, the grid was stored as a one dimensional array in memory, and to recover the three dimensional array structure, each point had to perform a swizzling method, which also involved using integer modulo operations. Add in the fact that we do this 16 million times per step (once per point on the grid, so 16 million is for the 256 grid) and you see why our speedup was not as good as it could have been. A low resolution video of our CUDA multigrid implementation has been recorded along with a high-resolution screen capture. Both are provided here in that order:

    Unfortunately, the video recording software we used does not record the high-frame rates of our program very well and thus lags much more than the actual version does. Furthermore, our version is more brilliant in terms of color as the screenshot on the right reveals. We use the a simple 3D rotation technique based on which axes the user is manipulating for rotating the cube, and we provide mouse combinations to allow the user to zoom in and out of the cube and translate themselves in space about the cube. These features are all documented in the above video. As we have refactored Brent Oster's C code to produce our CUDA version, we have kept the interaction mechanisms he has implemented intact. We have added functionality to switch rendering on and off to be able to accurately measure how many steps are computed per second, as the rendering negatively impacts it.(Update 03/11/08: We note that we have made little attempt to reduce the impact of rendering as we only timed it with rendering off)


    On December 14, 2007, this material was presented to our 'GPGPU and 3D User Interfaces' class. A PDF of the presentation can be found here. Additionally, the code for this project can be downloaded here. The code used for this project consists of three files: NC_Multigrid.cpp, NC_Multigrid.h, NanoCAD2.cpp, and the CUDA source code, Multigridcu.cu. They can be found at the location "Projects/NanoCAD2/Modules/", with the exception of the driver file, which is at "Projects/NanoCAD2/NanoCAD2.cpp".

    Pros and Cons of CUDA

    CUDA provides a novel way for the programmer to use the graphics hardware to its fullest. It is both a high and low level language, but because it is a relatively new language, it is not without its downsides.

    Pros:
  • CUDA abstracts as much of the hardware away as you want it to. Although to really get the best performance you need to dive under the hood and program for the exact hardware you're on, you can still get a decent speedup without worrying about it too much. The future of parallel programming languages will rely on their ease of use and how much they can abstract away from the programmer.

  • CUDA is an important step forward in parallel computing. The design choices made by the CUDA team will shape how future languages deal with heterogeneous processing systems (here a CPU and a GPU).

  • It's really cool to launch 16 million threads and have them do something actually useful. Using less than 10 threads at a time in undergraduate classes gives you the impression that you can do everything you'd ever want to with just 1 or 2 threads. But it's something else to launch a thread for each point in your grid, and if you have a 256 x 256 x 256 grid, launching 16 million threads really gives computing a whole different spin.

  • There is a lot of potential behind CUDA. Making supercomputing available to the general public will make it much easier to teach new programmers about how to program parallel machines. There's also lots of work to be done resolving the downsides of CUDA, which are:

    Cons: (Clarified on 03/11/08 in response to user feedback)
  • Developing in CUDA on Windows with Visual Studio 2005, the supported environment of CUDA, is unreasonably difficult. Getting started with CUDA was definitely the hardest part of this entire project. It took me no less than two weeks to make a new project and work with it. That being said, I'm no newcomer to Windows (used it almost my whole life) and Visual Studio 2005 (used it to develop C# applications). However, CUDA is something entirely different. There is a "custom build rule" for CUDA that take a lot of the work out of setting up Visual Studio for use with CUDA, but even after following the directions on using it, it still wouldn't work. Even taking the sample projects that came with CUDA and trying to refactor them only worked for a short while, although this may have had to do with...

  • Remote Desktop Connection's interaction with Windows makes it unusable. I know, this isn't a CUDA problem, as RDC isn't supported by CUDA, but there's no reason it should be any different from using Visual Studio with other languages. RDC didn't seem to consistently convert my newlines on Mac (Line Feed) to Windows (Carriage Return + Line Feed), so Visual Studio thought my code was only one line long. It didn't stop the code from working, but it's incredibly annoying to have to deal with those error messages about having a problem somewhere on your one line of code and having the compiler point at the beginning of your code as the problem. However, since the same code works fine with the right newlines, I'd be inclined to say the CUDA compiler (nvcc) doesn't seem to like the mix of Line Feeds and Carriage Returns that RDC is putting in there. So I had to ditch RDC and sit in front of the computer I was developing on (as I was having issues getting VNC to work). I was having problems getting emulation mode to work, even after copying all the emulation mode flags the compiler needs, so I had to use the regular mode. That's not a problem, however, until you see what happens when you use too many blocks in regular mode:

    This wouldn't be so bad except for the fact that this happens almost every time something goes wrong, not just with the blocks. In the last four hours of programming in CUDA, this has happened to me FIVE times. Future implementations of CUDA NEED to fix this for CUDA to become even close to viable as a good programming language. My computer shouldn't die just because I used memory that I allocated and set to zero but forgot to copy the host's version of to. That leads into the next issue...

  • CUDA inconsistently throws errors / crashes. Usually after I see the blue screen of death, I tweak the code and try to compile it, only to see one of thse two screens:


    Trying to recompile the code gives no errors and the code runs fine. So what's the deal? I can't even get the blue screen of death consistently! Sometimes Visual Studio gives me a cudaError_enum, sometimes the computer locks up, and sometimes it gives me the blue screen. Why can't it just be consistent, and consistently the first result?

  • GPU methods cannot call GPU methods, and consequently, recursion is not allowed. This is an unfortunate side-effect, and has caused the GPGPU teams to reimplement all the well-known recursive functions (e.g., quick sort) iteratively or find iterative algorithms that suit their purposes better (e.g., bitonic merge sort for the GPU). At this time, it is unclear whether this issue is CUDA-specific or specific to the GPU architecture.

  • The documentation, while long, fails to mention important macros you can call to improve the quality of your code. If you read the documentation alone, you'd have no idea there was a macro you could wrap your cudaMalloc and cudaFree calls around to make sure they succeeded or your program died. It's only two extra macro to document, and perhaps two of the most useful macros in CUDA. I can't tell you for sure, because I've only read the documentation...

  • CUDA will encourage poor software engineering skills. This could be the greatest offender, depending on your programming skills. Structs aren't allowed as of CUDA 1.0, forcing the programmer to write methods that pass a huge amount of data around. My typical method needed to throw around 6 or 7 variables when one or two well built structs would have been fine. Furthermore, without structs or objects (since it's C and not C++), you'll find yourself violating the DRY (Don't Repeat Yourself) principle constantly. If you're not familiar with the DRY principle, you may think this isn't a big deal, as you're just repeating yourself. But this makes your code harder to maintain, even for just you who knows the code, and with CUDA being a new language, there's no reason not to have a construct that packages up data. Perhaps it has to do with the architecture like recursion did...(Update 03/11/08: Other CUDA groups have claimed that they are able to use structs in their CUDA code. We were not, and do not have our CUDA machine to verify this on. We are thus unresolved about the status of this.)

    Conclusion

    Whew! What a long review! It's been a great quarter full of new opportunities, and thanks to all for letting me use what I've needed to get this done. Special thanks go out to Tobias Hollerer and Brent Oster for use of the hardware and software, respectively, that was used or refactored in some way for this project. CUDA has the potential to go far for sure, but there's definitely some huge kinks it needs to work out before it's ready for the general public.

    Chris Bunch

    Last Modification on 03/11/08
    First Revision Complete 12/12/07

    CSS Style Sheets and Web Layout designed by Ben Meadowcroft

    Creative Commons License
    The CSS style sheets used here are licensed under a Creative Commons License.