Thursday, June 3rd, 2010
8:30am – CS Conference Room
Committee: Fred Chong (chair), Tim Sherwood, Diana Franklin, Bronis R. de Supinski
Title: Exploiting Data Similarity in Multicore Systems
As we move from tens to hundreds of cores on a chip, it is easy to get lost trying to think of all the new ways this raw performance potential can be unleashed on those traditional applications we are all so comfortable with. Programming these chips to be efficient on these traditional workloads is important, but it is also quite tricky. As a single memory stream scales to tens, hundreds, or even thousands of reference streams, the memory system will struggle to service all the requests in a timely manner. Furthermore, even in the embedded markets where these massively parallel cores are already making inroads (e.g. Tilera TILE64, Ambric Am2045, and Nvidia GeForce GT200), the effort to modify these applications to be both correct and efficient at those levels of parallelism is non-trivial.
One natural, but easily overlooked, way to make use of this raw computational power of hundreds of cores is through the execution of multiple copies of the same program with different input data or parameters. When solving real problems (rather than running benchmarks), this model of parallelism is already common practice. For example, in circuit-simulation processes the same simulator is used on the same circuit with various values of simulation parameters (parameter sweep). While these simulation processes are independent (in that there are no dependencies between processes), they share something very important that is not typically exploited by the architecture: the contents of much of their data. We observe this computing model more prevalent in high performance computing (HPC) domain where many MPI (message passing interface) tasks solve fragments of a large problem and infrequently communicate with each other. In HPC domain the limiting factor is the amount of physical memory as the compute nodes often lack disk storages. Prior research in this field has explored various techniques such as distributed shared memory (DSM), compression in memory or cache, cooperative caching etc. to increase effective memory size, but has been ineffective due to high overhead in the process.
In this talk, I shall present our research on developing software and hardware techniques to achieve this goal. The hardware approach introduces “Mergeable cache” which maintains a single copy of identical data blocks in cache, and thereby, increases cache capacity. By increasing the effective cache capacity, applications execute faster. In our software approach, the target being a transparent user level solution, we developed a memory allocation library SBLLmalloc that uses shared memory to reduce duplicate pages from MPI tasks transparently and reduces the memory footprint of MPI tasks in every node to run large problems which are not solvable using the same resources otherwise.