The Scale Vector-Thread Architecture

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Jared Casper, Krste Asanović

Parallelism and locality are the key application characteristics exploited by computer architects to make productive use of increasing transistor counts while coping with wire delay and power dissipation. Conventional sequential ISAs provide minimal support for encoding parallelism or locality, so high-performance implementations are forced to devote considerable area and power to on-chip structures that extract parallelism or that support arbitrary global communication. The large area and power overheads are justified by the demand for even small improvements in performance on legacy codes for popular ISAs. Many important applications have abundant parallelism, however, with dependencies and communication patterns that can be statically determined. ISAs that expose more parallelism reduce the need for area and power intensive structures to extract dependencies dynamically. Similarly, ISAs that allow locality to be expressed reduce the need for long-range communication and complex interconnect. The challenge is to develop an efficient encoding of an application's parallel dependency graph and to reduce the area and power consumption of the microarchitecture that will execute this dependency graph.

The vector-thread (VT) architectural paradigm unifies the vector and multithreaded execution models. VT allows large amounts of structured parallelism to be compactly encoded in a form that allows a simple microarchitecture to attain high performance at low power by avoiding complex control and datapath structures and by reducing activity on long wires. The VT programmer's model extends a conventional scalar control processor with an array of slave virtual processors (VPs). VPs execute strings of RISC-like instructions packaged into atomic instruction blocks (AIBs). To execute data-parallel code, the control processor broadcasts AIBs to all the slave VPs. To execute thread-parallel code, each VP directs its own control flow by fetching its own AIBs. Implementations of the VT architecture can also exploit instruction-level parallelism within AIBs.

In this way, the VT architecture supports a modeless intermingling of all forms of application parallelism. This flexibility provides new ways to parallelize codes that are difficult to vectorize or that incur excessive synchronization costs when threaded. Instruction locality is improved by allowing common code to be factored out and executed only once on the control processor, and by executing the same AIB multiple times on each VP in turn. Data locality is improved as most operand communication is isolated to within an individual VP.

We have developed a prototype processor, Scale, which is an instantiation of the vector-thread architectural paradigm designed for low-power and high-performance embedded systems. As transistors have become cheaper and faster, embedded applications have evolved from simple control functions to cellphones that run multitasking networked operating systems with realtime video, three-dimensional graphics, and dynamic compilation of garbage-collected languages. Many other embedded applications require sophisticated high-performance information processing, including streaming media devices, network routers, and wireless base stations. The Scale architecture is intended to replace ad-hoc collections of microprocessors, DSPs, FPGAs, and ASICs, with a single hardware substrate that supports a unified high-level programming environment and provides performance and energy-efficiency competitive with custom silicon.

We taped out the Scale chip on 10/15/2006 in TSMC 0.18 micron process, and the design was fully functional. Full details of the final implementation are available in [13].

Annotated die plot of the Scale chip.
Click on image for higher resolution version.

Publications

[1]		"The SCALE DRAM Subsystem", Brian Pharris, M.Eng. Thesis, Massachusetts Institute of Technology, May 2004 (PDF thesis)
[2]		"The Vector-Thread Architecture", Ronny Krashinsky, Christopher Batten, Mark Hampton, Steven Gerding, Brian Pharris, Jared Casper, and Krste Asanović, 31st International Symposium on Computer Architecture (ISCA-31), Munich, Germany, June 2004. (PDF paper, PDF slides, PPT slides)
[3]		"The Vector-Thread Architecture", Ronny Krashinsky, Christopher Batten, Mark Hampton, Steven Gerding, Brian Pharris, Jared Casper, and Krste Asanović, IEEE Micro Special Issue: Micro's Top Picks from Microarchitecture Conferences, November/December 2004. (PDF paper)
[4]		"Cache Refill/Access Decoupling for Vector Machines", Christopher Batten, Ronny Krashinsky, Steve Gerding, and Krste Asanović, 37th International Symposium on Microarchitecture (MICRO-37), Portland, OR, December 2004. (PDF paper, PDF slides, PDF slides+notes)
[5]		"A Parameterizable FPGA Prototype of a Vector-Thread Processor", Jared Casper, Ronny Krashinsky, Christopher Batten, and Krste Asanović, 1st Workshop on Architecture Research using FPGA Platforms (WARFP-2005), at 11th International Symposium on High Performance Computer Architecture (HPCA-11), San Francisco, CA, February 2005. (PDF abstract, PPT slides)
[6]		"SCALE DRAM Subsystem Power Analysis", Vimal Bhalodia, M.Eng. Thesis, Massachusetts Institute of Technology, September 2005, (PDF thesis)
[7]		"Implementing Virtual Memory in a Vector Processor with Software Restart Markers", Mark Hampton and Krste Asanović, 20th ACM International Conference on Supercomputing (ICS06), Cairns, Australia, June 2006. (PDF paper, PDF slides, PPT slides)
[8]		"Scale Control Processor Test-Chip", Christopher Batten, Ronny Krashinsky, and Krste Asanović, Technical Report MIT-CSAIL-TR-2007-003, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, January 2007. (PDF techreport)
[9]		"Vector-Thread Architecture and Implementation", Ronny Krashinsky, Ph.D. Thesis, Massachusetts Institute of Technology, May 2007 (PDF thesis) *Winner, George M. Sprowls Award for best Ph.D. thesis in Computer Science, MIT, 2007*
[10]		"The Scale Vector-Thread Processor", Ronny Krashinsky, Christopher Batten, and Krste Asanović, (PPT poster) *Winner, DAC/ISSCC Student Design Contest, February/June 2007*
[11]		"Reducing Exception Management Overhead with Software Restart Markers", Mark J. Hampton, Ph.D. Thesis, Massachusetts Institute of Technology, February 2008. (PDF thesis)
[12]		"Compiling for Vector-Thread Architectures", Mark Hampton and Krste Asanović, International Symposium on Code Generation and Optimization (CGO-2008), Boston, MA, April 2008. (PDF paper, PPT slides)
[13]		"Implementing the Scale Vector-Thread Processor", Ronny Krashinsky, Christopher Batten, and Krste Asanović, ACM Transactions on Design Automation of Electronic Systems (TODAES), 13(3), 41:1-41:24, July 2008. (PDF paper)

Funding

We gratefully thank the past and present sponsors of this work, including DARPA, NSF, CMI, Infineon, and Intel.