MapScale: A Cloud Environment for Scientific Computing
Chris Bunch (CS), Brian Drawert (CSE), Matthew Norman (CS)
As we have shown in our first homework assignment, the MapReduce framework presents an novel opportunity to parallelize scientific computing applications. We thus aim at implementing each of the eight benchmarks that make up the NAS Parallel Benchmarks with the MapReduce framework. After this has been accomplished, we will evaluate their performance for varying numbers of nodes as provided to us by our underlying cloud infrastructure (Eucalyptus) and platform (AppScale). If time permits, we will also evaluate the performance of our system on Amazon's Elastic Compute Cloud and note the performance differences between the two infrastructures. We are also interested in implementing some of the less difficult benchmarks across several languages and runtimes in order to explore performance differences across this spectrum.
Progress Report (as of 5/18): We have distributed the eight NAS Parallel Benchmarks amongst the three of us in the following fashion:
Chris Bunch
- Embarrassingly Parallel: Generate a large number of random numbers via the Marsaglia polar method. Number generation is done in the mapper, and implementations in various languages will be explored for possible performance differences. Currently code has been constructed in C, Java, and Ruby. Further languages we wish to explore are Clojure and R. Note that no reducer is needed since all the work is done in the mapper. This also means we can use an "identity mapper" that simply outputs what it receives. A highly optimized identity mapper is provided by the underlying MapReduce framework, and we seek to explore what the differences between the two methods entail.
Integer Sort: Sort a large number of integers via bucket sort. Research on bucket sort has revealed that the algorithm relies on the input set being a uniform random distribution of numbers between zero and one, so we are currently exploring how to quickly normalize our input data (integers) to be between zero and one and denormalize them. As was the case in the embarassingly parallel benchmark, we wish to explore implementations of this in several languages and compare the performance of each. We currently have delayed this benchmark until we are done with the embarassingly parallel benchmark, since the random numbers that benchmark generates can be used as input to this benchmark (although the numbers generated are not uniformly distributed, research has indicated that as long as they are "close enough" then this does not impact the asymptotic running time). Finally, since the MapReduce framework performs an internal sorting phase on its keys in order to ensure every reducer receives all the values with the same key, we also wish to compare this highly optimized sorting routine with our own.
Brian Drawert
Block Tridiagonal, Scalar Pentadiagonal, and Lower-Upper symmetric Gauss-Seidel: All three of these solve a system of nonlinear PDEs using different methods, so it is believed that a common skeleton can be used to construct all three with the exception of the solver itself. Currently investigating how this can be optimally constructed and researching the differences between the three different solvers.
Matthew Norman
- Conjugate Gradient: Currently are in the process of converting MPI code for conjugate gradient from homework 3 into code that uses the MapReduce paradigm. Suspected methods of parallelism are clearly MatVec, DAXPY, and DDOT. Papers previously covered in researching this product have provided mappers and reducers for MatMul and DDOT, so it is clear that this algorithm is highly amenable to MapReduce's semantics.
- Multigrid: A serial reference implementation is available that was parallelized on the GPU for a previous class. Once Conjugate Gradient is completed, work will begin on converting this serial implementation to one using MapReduce. Initial prospects show that we can trivially perform the Red-Black Gauss-Seidel elimination on smaller grids in MapReduce, but that this may not be enough computation to amortize the communication incurred. Various size grids will be used to see how much performance we can gain (sample grid sizes are provided in the NAS Parallel Benchmark specifications for this problem).
Fast Fourier Transform: Delayed until previous benchmarks are done, as more information is required as to where a reference implementation of this can be found and what methods are core to performing FFTs.
The writeup for the paper has also been started, since the introduction and other informative sections on the technologies involved can be completed without having finished our performance analysis and optimizations. We are certainly open to suggestions and recommendations on the various algorithms, as we suspect some of these algorithms are sufficiently complex to require multiple rounds of MapReduce to yield their final answer.
Presentation Slides
Paper: MapScale: A Cloud Environment for Scientific Computing