Highly Parallel Memory Systems

With the dramatic scaling in individual processor performance and in the number of processors within a system, the memory system has become an even more critical component in total system performance. We are investigating several aspects of high-performance low-power memory systems.

On-chip memory hierarchies to cope with the cross-chip latency of future chip multiprocessor systems. Chip multiprocessors (CMPs) solve several bottlenecks facing chip designers today. They deliver higher performance than wide superscalars by exploiting thread-level parallelism; mitigate global wire delay by minimizing cross-chip communications; and reduce power consumption by using less aggressive processor cores and lower clock frequencies.
In our studies, we consider tiled CMPs, a class of CMPs that each tile contains a slice of the total on-chip L2 cache storage, and tiles are connected by an on-chip network. The L2 slices can be managed using two basic policies. First, each slice can be used as a private L2 for the tile. Second, all slices are aggregated for form a single large L2 shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity because each tile creates local copies of any block it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs longer hit latencies when L2 data is on a remote tile.
We present two new policies, victim replication and victim migration, both of which combine the advantages of private and shared designs. They are variants of the shared scheme which attempts to keep copies of local L1 cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of single-threaded, multi-threaded, and multi-programmed workloads running on an eight-processor tiled CMP. We show that both techniques achieve good performance improvements on our workloads.

Scalable coherence protocols to support thousands of processor nodes.

Low-power DRAM subsystems to reduce the power dissipation of DRAM banks.

Large-scale memory simulation techniques to enable simulation of kiloprocessor systems. One such technique is the Memory Timestamp Record (MTR) which is a fast and accurate technique for initializing the directory and cache state of a multiprocessor system. The MTR is a versatile, compressed snapshot of memory reference patterns which can be rapidly updated during fast-forwarded simulation, or stored as part of a checkpoint. We evaluate MTR using a full-system simulation of a directory-based cache-coherent multiprocessor running a range of multithreaded workloads. Both MTR and a multiprocessor version of functional fast-forwarding (FFW) make similar performance estimates, usually within 15% of our detailed model. In addition to other benefits, we show that MTR has up to a 1.45x speedup over FFW, and a 7.7x speedup over our detailed baseline [1].

Publications

[1]		"Accelerating Multiprocessor Simulation with a Memory Timestamp Record" Kenneth C. Barr, Heidi Pan, Michael Zhang, and Krste Asanovic, IEEE Int'l Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, March 2005. (PDF paper)

[2]		"Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled CMPs" Michael Zhang and Krste Asanovic, IEEE Int'l Symposium on Computer Architecture (ISCA-32), Madison, WI, June 2005. (PDF paper)

[2]		"Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches" Michael Zhang and Krste Asanovic, MIT CSAIL Technical Report, MIT-CSAIL-TR-2005-064, October 2005. (PDF paper)

Funding

We gratefully thank the past and present sponsors of this work, including NSF, DARPA, CMI, SGI, IBM, and Intel.