Computer Architecture - Semantic Scholar

119 downloads 0 Views 303KB Size Report
explicitly parallel instruction computing (EPIC)—extra information generated by the ... Other required new technologies may delay the launch too (or even fail, ...
Computer Architecture: a qualitative overview of Hennessy and Patterson

Philip Machanick

September 1998; corrected December 1998; March 2000; April 2001

contents Chapter 1 Introduction ...................................................................1 1.1 Introduction ........................................................................... 1 1.2 Major Concepts ...................................................................... 2 1.2.1 1.2.2 1.2.3 1.2.4

1.3 1.4 1.5 1.6

Latency vs. Bandwidth .............................................................2 What Computer Architecture Is ...............................................3 The Quantitative Approach ......................................................4 How Performance is Measured ................................................4

Components of the Course .................................................... 6 The Prescribed Book ............................................................. 6 Structure of the Notes ............................................................ 6 Further Reading ..................................................................... 7

Chapter 2 Performance Measurement and Quantification ............9 2.1 2.2 2.3 2.4

Introduction ........................................................................... 9 Why Performance is Important .............................................. 9 Issues which Impact on Performance .................................. 10 Change over Time: Learning Curves and Paradigm Shifts . 12 2.4.1 Learning Curves ....................................................................12 2.4.2 Paradigm Shifts ......................................................................13 2.4.3 Relationship Between Learning Curves and Paradigm Shifts 14 2.4.3.1 Exponential ...............................................................15 2.4.3.2 Merced (EPIC, IA-64, Itanium) ................................16 2.4.3.3 Why Paradigm Shifts Fail .........................................17

2.5 Measuring and Reporting Performance ............................... 18 2.5.1 Important Principles ..............................................................20

2.6 2.7 2.8 2.9

Quantitative Principles of Design ........................................ 20 Examples: Memory Hierarchy and CPU Speed Trends ...... 21 Further Reading ................................................................... 22 Exercises .............................................................................. 23

iii

Chapter 3 Instruction Set Architecture and Implementation .......25 3.1 Introduction ......................................................................... 25 3.2 Instruction Set Principles: RISC vs. CISC .......................... 26 3.2.1 broad classification ................................................................28

3.3 Challenges for Pipeline Designers ....................................... 29 3.3.1 Causes of Pipeline Stalls .......................................................30 3.3.2 Non-Uniform Instructions ......................................................31

3.4 3.5 3.6 3.7 3.8

Techniques for Instruction-Level Parallelism ..................... 32 Limits on Instruction-Level Parallelism .............................. 35 Trends: Learning Curves and Paradigm Shifts .................... 35 Further Reading ................................................................... 37 Exercises .............................................................................. 37

Chapter 4 Memory-Hierarchy Design .........................................39 4.1 4.2 4.3 4.4 4.5 4.6

Introduction ......................................................................... 39 Hit and Miss ........................................................................ 40 Caches .................................................................................. 41 Main Memory ...................................................................... 44 Trends: Learning Curves and Paradigm Shifts .................... 45 Alternative Schemes ............................................................ 47 4.6.1 Introduction ...........................................................................47 4.6.2 Direct Rambus .......................................................................48 4.6.3 RAMpage ...............................................................................49

4.7 Further Reading ................................................................... 51 4.8 Exercises .............................................................................. 51

Chapter 5 Storage Systems and Networks ...................................53 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10

Introduction ......................................................................... 53 The Internal Interconnect: Buses ......................................... 54 RAID ................................................................................... 54 I/O Performance Measures .................................................. 55 Operating System Issues for I/O and Networks .................. 55 A Simple Network ............................................................... 56 The Interconnect: Media and Switches ............................... 56 Wireless Networking and Practical Issues for Networks .... 57 Bandwidth vs. Latency ........................................................ 57 Trends: Learning Curves and Paradigm Shifts .................... 59

iv

5.11 Alternative Schemes ............................................................ 60 5.11.1Introduction ...........................................................................60 5.11.2Scalable Architecture for Video on Demand .........................61 5.11.3Disk Delay Lines for Scalable Transaction-Based Systems ..62

5.12 Further Reading ................................................................... 63 5.13 Exercises .............................................................................. 64

Chapter 6 Interconnects and Multiprocessor Systems .................67 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11

Introduction ......................................................................... 68 Types of Multiprocessor ...................................................... 68 Workload Types .................................................................. 69 Shared-Memory Multiprocessors (Usually SMP) ............... 69 Distributed Shared-Memory (DSM) .................................... 71 Synchronization and Memory Consistency ......................... 72 Crosscutting Issues .............................................................. 73 Trends: Learning Curves and Paradigm Shifts .................... 74 Alternative Schemes ............................................................ 75 Further Reading ................................................................... 75 Exercises .............................................................................. 76

Chapter 7 References ...................................................................77

v

vi

Chapter 1 Introduction Computer Architecture is a wide-ranging subject, so it is useful to find a focus to make it interesting and to make sense of the detail. The modern approach to computer architecture research is to quantify as much as possible, so much of the material covered is about measurement, including how to report results and compare them. These notes however aim to provide a more qualitative overview of the subject, to place everything in context. The quantitative aspect is however important and is covered in the exercises. 1.1 Introduction The prescribed book contains a lot of detail. The aim of these notes is to provide a specific spin on the content as well as to provide a more abstract view of the subject, to help to make it easier to understand the detail. The main focus of this course is understanding the following two related issues: •

the conflict between achieving latency and bandwidth goals



long-term problems caused by differences in learning curves

This introductory chapter explains these concepts, and provides a starting point for understanding how the prescribed book will be used to support the course. First, the concepts are defined and explained. Next, this chapter goes on to break the course down into components, and then the components are related to contents of the book. Finally, the structure of the remainder of the notes is presented.

1

1.2 Major Concepts This section looks at the main focus of the course, latency vs. bandwidth, and also introduces some of the major concepts in the course: what architecture is, the qualitative principle of design, and how performance is estimated or measured. 1.2.1 Latency vs. Bandwidth Loosely, latency is efficiency to the user, bandwidth is overall efficiency of the system. More accurately, latency is defined as time to complete a specific operation. Bandwidth (also called throughput) is the number of units of work that can be completed over a specific time unit. These two measures are very different. Completing a specific piece of work quickly to ensure minimum latency can be at the cost of overall efficiency, as measured by bandwidth. For example, a disk typically takes around 10ms to perform an access, most of which is time to move the head to the right place (seek time), and to wait for rotation of the disk (rotational delay). If a small amount of data is needed, the minimum latency is achieved if only that piece of data is read off the disk. However, if many small pieces of data are needed which happen to be close together on the disk, it is more efficient to fetch them all at once than to do a separate disk transaction for each, since less seek and rotational delay time is required in total. An individual transaction is slower (since the time to transfer a larger amount of data is added onto it), but the overall effect is more efficient use of the available bandwidth, or higher overall throughput. Balancing latency and bandwidth requirements in general is a hard problem. Issues which make it harder include: •

latency has to be designed in from the start since latency reduction is limited by the worst bottleneck in the system



bandwidth can in principle be added more easily (e.g. add more components working in parallel; simple example: make a wider sewer pipe, or add more pipes)



improving one is often at the expense of the other (as in the disk access example)

2



different technologies have different learning curves (the rate at which improvements are made) which tends to invalidate design compromises after several generations of improvements

1.2.2 What Computer Architecture Is Computer Architecture, broadly speaking, is about defining one or more layers of abstraction in defining a virtual machine. At a coarse level, architecture is divided into software and hardware architecture. Software architecture defines a number of layers of abstraction, including the operating system, the application programming interface (API), user interface and possibly higher-level software architectures, like object-oriented application frameworks, or component software models. This course is concerned primarily with hardware architecture, though software is occasionally mentioned. Hardware architecture again can be divided into many layers, including: •

system architecture—interaction between major components, including disks, networks, memory, processor and interconnects between the components



I/O architecture—how devices such as disks interact with the rest of the system (often including software issues)



instruction set architecture (ISA)—how a programmer (or more correctly in today’s world, a compiler) sees the machine: what the instructions do, rather than how they are implemented



computer organization—how low-level components interact, particularly (but not only) within the processor



memory hierarchy—interaction between components and other parts of the system (including software aspects, particularly relating to the operating system)

This course covers most of these areas, but the strongest focus is on the interaction between the ISA, processor organization and the memory hierarchy. Note that many people really mean “ISA” when they say “architecture”, so take care to be clear on what is being discussed.

3

1.2.3 The Quantitative Approach Since the mid-1980s, computer designers have increasingly accepted the quantitative approach to computer architecture, in which measurement is made to establish impacts of design decisions on performance. This approach was popularized largely as a result of the RISC movement. RISC was essentially about designing an ISA which made it easier to achieve high performance. To persuade computer companies to buy the RISC argument, researchers had to produce convincing measurements of systems that had not yet been constructed to show that their approach was in fact better. This is not to say that there was no attempt at measuring performance prior to the RISC movement, but the need to sell a new, as yet untested, idea helped to push the quantitative approach into the mainstream. Up to that time, performance had often been measured on real systems using fairly arbitrary measures that did not allow meaningful performance comparisons across rival designs. 1.2.4 How Performance is Measured If performance of a real system is being measured, the most important thing to measure is run time as seen by the user. Many other measures have been used, like MIPS (millions of instructions per second, otherwise known as “meaningless indicator of performance of systems”). Although the most important thing to the user is the program that slows down their work, everyone wants standard performance measures. As a result, benchmarks—programs whose run times are used to characterize performance of a system—of various forms are in wide use. Many of these are not very useful, as they are too small to exercise major components of the system (for example, they don’t use the disk, or they completely fit into the fastest level of the memory hiearchy). In the 1980s, the Standard Performance Evaluation Corporation (SPEC) was formed in an attempt at providing a standard set of benchmarks, consisting of real programs covering a range of different styles of computation, including text processing and numeric computation. At first, a single combined score was available, called the SPECmark, but in recent ver-

4

sions of the SPEC benchmarks, the floating point and integer scores have been separated, to allow for the fact that some systems are much stronger in one of these areas than their competitors. Even so, the SPEC benchmarks have had problems as indicators of performance. Some vendors have gone to great lengths to develop specialized compilation techniques that work on specific benchmarks but are not very general. As a result, SPEC numbers now include a SPECbase number, which is the combined run times in which all the SPEC programs were compiled using the same compiler options. Another problem with SPEC is that it doesn’t scale up well with increases in speed. A faster CPU will not just be used to solve the same problem faster, but also to solve bigger problems, but the SPEC benchmarks used fixed-size datasets. As a result, the effect of bigger data is not captured. A very fast processor with too little cache, for example, may score well on SPEC, but disappoint when a real-world program with a large dataset is run on it. These problems with SPEC are addressed by regular updates to the benchmarks. However, the problem then becomes one of having no basis for historical comparisons. In the area of transaction processing, another set of benchmarks, called TPC (Transaction Processing Council), tries to address the scaling problem. The TPC benchmarks measure performance in transactions per second, but to claim a given level of TPS, the size of the data has to be scaled up. Section 5.4 revisits the idea of scalable benchmarks which applies not only to I/O but to the more general case as well. If a real system does not exist, there are several approaches that can be taken to estimate performance: •

calculation—though it’s hard to take all factors into account, it’s possible to do simple calculations to quantify the effects of a design change; such calculations at least help to justify more detailed investigation



partial simulations—a variety of techniques make it possible to measure the effects of a design change, without having to simulate a complete system



complete simulations—possibly even including operating system code; such simulations are slow, but may be necessary to evaluate major design changes (e.g., adoption of a radical new instruction set)

Since performance measurement is complex, we will return to this issue in more detail in Chapter 2. 5

1.3 Components of the Course The course is broken down into five components, and a practical workshop: 1. performance measurement and quantification, which expands on the description of the previous section 2. instruction set architecture and implementation, including pipelining and instructionlevel parallelism 3. memory-hierarchy design 4. storage systems and networks 5. interconnects and multiprocessor systems, which ties together several earlier sections The practical workshop will focus on memory hierarchy issues. 1.4 The Prescribed Book The prescribed book for the course is John L Hennessy and David A Patterson. Computer Architecture: A quantitative Approach (3rd edition), Morgan Kauffman, San Francisco, to be published. Content is generally derived fairly directly from the book, if topics are sometimes grouped slightly differently. It is assumed that you know something of computer organization from previous courses (e.g. the basics of a simple pipeline, what registers are, what an assembly language program is, etc.), so sections of the book relating to these topics are handled superficially. Testing in this course is open book, so it is important to have a copy of the book (and not an earlier edition, which are significantly different). 1.5 Structure of the Notes The remainder of the notes contains one chapter for each of the major components of the course—performance measurement and quantification, instruction set architecture and implementation, memory-hierarchy design, storage systems and networks, and, finally, interconnects and multiprocessor systems. Each chapter ends with a Further Reading section, and each chapter after this one contains exercises, some of which are pointers to exercises in the prescribed book.

6

1.6 Further Reading For further information on standard benchmarks, the following web sites are useful: •

Transaction Processing Performance Council



Standard Performance Evaluation Corporation To find out more about a popular architecture simulation tool set, see SimpleScalar at





Please also review Chapters 1 and 2 of the prescribed book.

7

8

Chapter 2 Performance Measurement and Quantification Performance measurement is a complex area, often reduced to bragging rights and misleading marketing. It’s important to understand not only what to measure, but how to present and interpret results. Otherwise, it’s easy to mislead or be misled. 2.1 Introduction This chapter is largely based on Chapter 1 of the Hennessy and Patterson, “Fundamentals of Computer Design”, but with some material drawn from elsewhere to add to the discussion. The material presented here aims not only to illustrate how to measure and to report results, but also how to evaluate the significance of results in terms of changes in technology. For this reason, issues which effect technology change are also considered. The remainder of the chapter is broken down as follows. First, why performance is important is considered. Then, issues which impact on performance are examined, followed by a discussion of changes over time. Then, the sections from the book on measuring and reporting performance (1.5) and quantitative principles of design (1.6) are summarized. Finally, memory hierarchy and changes in CPU design are combined to illustrate the issues and problems in predicting future performance. 2.2 Why Performance is Important I have often heard it said that “no one will ever need a PC faster than model X, but you shouldn’t buy model Y, it’s obsolete”. A year later, model X has moved to the obsolete slot in the same piece of sage advice. More speed for the same money is always easy to sell. But do people really need it?

9

Some might say no, it’s just like buying a faster car and then keeping to the speed limit, but there are legitimate reasons for wanting more speed: •

you can solve bigger problems in the same time



things that used to hold up your work cease to be bottlenecks



you can do things that were previously not possible

Many people argue that we are reaching limits on what a consumer PC really needs to do, but such arguments are not new. As speeds improve, expectations grow: high-speed 3D graphics for example, is becoming a commodity (even if the most exotic requirements are still the province of expensive specialized equipment). At the higher end, there is always demand for more performance, because, by definition, those wanting the best, fastest equipment are pushing the envelope, and would like more performance to achieve their goals, and to do bigger, better things. Always, though, the important thing to remember is that the issue of most importance to the user is whatever slows down achieving the desired result. Never lose sight of this when looking at performance measurement: any measures that do not ultimately tell you whether the response time to the user is improved are at best indirect measures and at worst useless. 2.3 Issues which Impact on Performance Obviously, if every component of a computer can be equally sped up by a given factor, a given task can run that many times faster. However, algorithm analysis tells us that unless the program runs in linear time or better, we will not see an equivalent increase in the size of problem that can be solved in the same amount of elapsed time. For example, if the algorithm is Ο(n2), you need a computer four times as fast to solve a problem twice as big in the same elapsed time. Thus, as computers become faster, it becomes paradoxically more important than before to find good algorithms, so the performance increase is not lost. Sadly, many software designers do not realize this: as computers get faster, time wasted on inefficiency grows; the modern software designer’s motto is waste transistors.

10

What are the factors that make it possible to improve the speed of a computer, aside from improving the software? Essentially, all components of the computer are susceptible to improvement, and if any are neglected, they will eventually dominate performance (see Section 2.5). Here are some areas of an overall computer system that can impact on performance, and issues a designer has to consider for each: •

disk subsystem—access time (time to find the right part of the disk, often thought of as latency, since this is the dominant factor in time for one disk transaction) and transfer rate are both important issues; which of the two is more important depends on how the disk is used but access time is the harder one to improve and is generally the reason that disks are a potential performance bottleneck; there are a number of software issues that need to be considered as well, such as the disk driver, the operating system and efficiency of algorithms that make extensive use of a disk; a disk also relates to the virtual memory system (see memory below)



network—latency again is a problem and is composed of many components, much of which is software, including assembling packets, queueing delays and other operating system overhead, and how the specific network protocol controls access to the network; bandwidth of a given network may be limited by capacity concerns (e.g., collisions may reduce the total available network traffic), physical speed of the medium, and speed of the computers, switches and other network components



memory—designers have to explore trade-offs in cost and speed; the traditional main memory technology, DRAM (dynamic random access memory) is becoming increasingly slow compared with processors, so designers have to use a hierarchy of differentspeed memories: the faster, smaller ones (caches) nearer the CPU help to hide the lower speed of slower, larger ones, with disk at the bottom of the hierarchy as a paging device in a virtual memory system



processor (or CPU: central processing unit)—designers here focus on two key issues: improving clock speed and increasing the amount of instruction-level parallelism (ILP): the amount of work done on each clock tick; this is not as easy as it sounds as programming languages are inherently sequential, and other components, particularly the memory system, interact with the CPU in ways that make it hard to scale up performance



power consumption—although not strictly a component but rather an overall design consideration, power consumption is an important factor in speed-cost trade-offs; some designers such as Compaq’s (formerly Digital’s) Alpha and Intel’s Pentium and successor teams have gone for speed over low power consumption, with the result that their designs add cost to an overall system and are not highly adaptable to mobile computing, or embedded designs (e.g. a computer as part of a car, toy or appliance)

11

2.4 Change over Time: Learning Curves and Paradigm Shifts What makes the issues a designer must deal with even more challenging is rapid advances in knowledge and more challenging still the fact that different areas are advancing at different rates. This section examines two styles of change in technology, learning curves (incremental improvement) and paradigm shifts (changes to new models). To put the two areas into context, the section concludes with a discussion of the way the two approaches to achieving change interact. 2.4.1 Learning Curves The general theory of learning curves goes back to the early days of the aircraft industry in the 1930s. The idea is that a regular percentage improvement over time leads to exponential advance. The general form of the formula is performance = PBt where B is the rate of “learning” of improvements over a given time interval (usually a year, but in any case the units in which elapsed time, t, is expressed), and P is a curve fit parameter. For example, if B is 1.1, there is a 10% improvement per year. Possibly the best-known learning curve in the computer industry is Moore’s Law, named for one of the founders of Intel, Gordon Moore, who predicted the rate of increase in the number of transistors on a chip. In practice, the combined effect of the increased number of transistors (which makes it possible for more to be done at once—more ILP) and smaller components (or reduced feature size—which makes it possible to run the chip faster) in recent times has lead to a doubling of the speed of CPUs approximately every 18 months. By contrast, in DRAM, the focus of designers is on improving density rather than on speed. You can buy four times as much DRAM for the same money approximately every 3 years (though the curve isn’t smooth: there are short-term fluctuations for issues like the launch of a new operating system that uses more memory). The basic cycle time of DRAM on the other hand only improves at the rate of about 7% per year, leading to an increasing speed gap between CPUs and main memory. 12

2.4.2 Paradigm Shifts What happens when a learning curve hits a hard limit (like a transistor starts to become to small in relation to the size of an electron to make sense physically)? Or what happens if someone comes up with a fundamentally better new idea? A learning curve is essentially an incremental development model. Although at a detailed level, a lot of imagination may be applied to the problem of speed improvement, reducing feature size, etc., the overall design framework remains the same. A different development model is radical change: the theory or model being applied changes completely. If the change is at a very deep level, it is called a paradigm shift. Essentially, a paradigm shift resets the learning curve. It may start from a new base, and may have a different rate of growth. Paradigm shifts are hard to sell for several reasons: •

they require a new mindset, and those schooled in the old way may often perversely expend considerable energy persuading themselves that the idea is not new, or even if it is, it can’t work



they require a complete new set of work experience to be built up



they may in extreme cases require that the entire infrastructure they plug into be changed (for example, if the execution model of the CPU is radically different, it may require new programming languages, compilers and operating systems, a new memory hierarchy, etc.)

Even so, paradigm shifts do occur, if usually in less radical forms. For example, the RISC movement was quite successful in reforming the previous trend towards increasing complexity of instruction set design. Here are some examples of paradigm shifts and their status: •

high-level language computing (HLL)—making the instruction set closer to high-level languages, to make compiler writing easier and potential execution faster: died of complexity; simpler instruction sets turned out to be easier to execute fast, and were less tied to specifics of a given language



microcode—the earliest computers had hard-wired logic but in the 1970s, microcode, a very simple hardware-oriented machine language, became popular as a way of implementing instruction sets; the idea essentially was killed when it no longer was possible to make a ROM much faster than other kinds of memory and, as with HLL

13

instruction sets, simplicity won the day; today microcode is all but dead, though we still refer to details of the architecture at a level below machine code as the microarchitecture •

dataflow computing—the idea was that the order of computation should depend on which data was ready, and not on the order instructions were written: mostly dead, except some internal details of the Intel Pentium Pro and successors (the P6 family) appear to be based on the dataflow model



asynchronous execution—a typical CPU wastes a lot of time waiting for the slowest part of the CPU (let alone other components of the system like DRAM), and the asynchronous model aims to eliminate this wastage by having the CPU operate at the fastest possible speed, rather than working on fixed timing of a clock signal: the jury is still out on this one, and research continues

2.4.3 Relationship Between Learning Curves and Paradigm Shifts Why then if a paradigm shift can jump ahead of everything else, is most research incremental? Why does industry prefer small steps to a big breakthrough? First, at a conceptual level, a paradigm shift, as noted before, can be a hard sell: it requires a change in one’s internal mental model to understand it fully. Secondly, if it really involves major change, it involves changing not only that particular technology but everything else that goes to make the overall system (for example, the dataflow model, to work properly, needed not only a different style of CPU but also a different style of memory). At a practical level, a big problem, especially if B for the competing learning curve is large, is that the paradigm shift is aiming at a moving target. The time and effort to get a completely new approach not only accepted but into production has to be measured against the gains the conventional approach will make. If there is any slip in the schedule, the paradigm shift may look unattractive, at least in its initial form. Perhaps, if its B is much larger than that of the conventional model, it could catch up and still beat the conventional model, but the opportunity to pick up all-important momentum and early design wins (selling the technology to system designers) is lost. Here are two examples, one that failed, and one that might—both relatively modest as paradigm shifts go—followed by a wrap-up discussion of how learning curves often defeat paradigm shifts.

14

2.4.3.1 Exponential Exponential, to much fanfare, announced that they had a promising new approach to highspeed processors. In 1995, they announced that they would soon produce a 500MHz version of the PowerPC, which would make it one of the fastest clock speeds of any shipping processor. Their big breakthrough was in a new strategy for combining Bipolar and CMOS logic (two alternative strategies for creating components on a chip). Historically, Bipolar has been used for high-speed circuits and CMOS where density or power consumption is a concern. Traditional microprocessors use CMOS, though a few core components may use Bipolar to increase speed. Exponential claimed they could do Bipolar as the main technology, with CMOS in particular areas where less speed (or lower power) was required, in a new form of BiCMOS—combined Bipolar and CMOS. This, they claimed, allowed them to achieve much higher clock speeds without the power consumption and cost penalty usually associated with Bipolar logic. Exponential’s big problem was time-to-market. Their 500MHz goal sounded impressive even 3 to 4 years later, but their processor was very simple in other areas, like a low number of instructions per clock cycle (IPC). Moreover, their timing was defeated by Moore’s Law. While their claims looked good in 1995, by mid-1997 when they were finally ready to ship, the IBM-Motorola consortium almost had the PowerPC 750 ready. While the PPC 750 started at much lower clock speeds (initial models were shipped with 233MHz and 266MHz), other details of the microarchitecture gave them unexpectedly good performance. Not only that, the PPC 750 was a very low-power-consumption design, which gave it strong cost advantages over Exponential, not to mention making it usable in both notebooks and desktop systems. Exponential did deliver in the end, but slightly too late: Moore’s Law, the learningcurve law, defeated their relatively modest paradigm shift. If they had been in volume production 6 months earlier, it may have been different: they would then have been on a learning curve as well, and likely one with the same learning rate, B, as the rest of the industry.

15

2.4.3.2 Merced (EPIC, IA-64, Itanium) Intel’s new architecture, a 64-bit processor running a new instruction set called IA-64, is another attempt at breaking out of the mold. The first version of the IA-64 family, codenamed Merced (and now being marketed as Itanium), appears to be in trouble, since its delivery schedule has slipped at least 5 years. Intel claims that the new design addresses the problem of obtaining more instructionlevel parallelism (an issue we’ll cover in more detail later when we do pipelines in the next chapter of these notes). The approaches they are using include: •

predication—an instruction can be flagged as only to be executed if a particular condition is true, reducing the need to have branch instructions around short pieces of code



very long instruction word (VLIW)—several instructions are grouped together in one big unit, to be executed at once: this is an approach which has been used before but which Intel claims to have done better



explicitly parallel instruction computing (EPIC)—extra information generated by the compiler tells the processor which instructions may be executed in parallel



very large number of registers—most current RISC designs have 32 general-purpose (i.e., integer) registers and 32 floating-point registers, while Merced has many more



speculative loads—a load instruction can be set up so that if it fails (e.g. because of a page fault or invalid address), it is ignored, so a load can be set up long before it’s actually needed; if the data is actually needed, it will have been fetched long enough before it’s needed that cache misses can be hidden

Intel (and HP who has also worked on the design and may have thought of most of the new ideas) claims that Itanium will be able to execute many more instructions in parallel than existing designs. However there are doubters. Most of the things introduced at best are simplifications of things that can be done anyway. VLIW was not a big success before, and the new model, while better in some ways, may not address enough of the problems (the biggest one: consistently finding enough work to parcel together in big multi-operation instructions to be worth the effort). Predication, while a good idea, turns out essentially to require pretty much the same hardware complexity as speculative execution (see Chapter 3, where we look at complex pipelines). A higher number of registers is good, but existing architectures get the same effect with register renaming which, while extra complexity,

16

means registers can be encoded using fewer bits in an instruction. Speculative loads are a good idea, but it’s not clear how big a difference they will make. The value of the EPIC idea depends on still-to-be-tested compilers, so it’s unclear how big a win it will be. The deep underlying problem about all this is that there is little evidence to support the view that there is a lot of untapped parallelism left to exploit at the local level (i.e. a few instructions apart). A combination of existing methods with new compiler-based approaches to identify parallelism further apart (e.g. in different procedures) might be more profitable. Or it might be that most current programs are simply not written in a style that exposes parallelism. Or maybe the problems we currently solve on computers have been chosen for their essentially sequential nature because that’s what we think computers can do best. A finally “or maybe”: maybe our existing programming languages aren’t really designed to expose parallelism. In summary, there are a lot of “or maybes” in deciding whether the IA-64 approach will be a win. The fact that Intel has shifted the shipping date several times suggests that even if it is a good idea, they will have a problem with making the great leap ahead of the competition that they’d hoped for. Since this is a widely-reported example, it is followed up in more detail in Section 3.5. 2.4.3.3 Why Paradigm Shifts Fail In general, why are paradigm shifts so hard? As suggested before, there are perception problems, but it goes deeper than this. Even if the paradigm shift increases the base from which the technology is judged, increases the rate of improvement B faster than older technologies, or both, there is usually an extra cost associated with getting a new technology working. Existing design, design verification and manufacturing techniques may not work, resulting in delays in initial launch. Other required new technologies may delay the launch too (or even fail, making the promise of the new technology hollow). In the Merced case, for example, heavy reliance is placed on new compiler techniques, which some doubt because the original VLIW movement also ran into problems in producing good enough compilers. In the case of Exponential, although the basic design appeared to deliver, the rival PowerPC 750, although ostensibly a simpler design, delivered better performance, because the details were finetuned to the workloads typically run on a Power Macintosh system. 17

Another example of the required infrastructure defeating a cool new idea is Charles Babbage’s 19th-century mechanical computers, which pushed the envelope of machinetool technology too hard. Working examples of his machines have been built in recent times, showing that the designs were fundamentally sound. The kinds of paradigm shifts described here are relatively shallow paradigm shifts: changes at the level seen by the system designer in the Exponential case, by the compiler writer as well in the Merced case. Deeper paradigm shifts of the kind that change the way we work or use computers in some ways seem harder to justify, yet they also can be easier to sell. All that’s required is that a large enough segment of the market adopt the idea to give it critical mass. Winning on the price-performance treadmill against Moore’s Law is less important in such cases, as a qualitative change may be the selling point, or an improvement in usability. Yet even here, most people prefer the paradigm shift to be disguised as incremental change (hence the greater success of Microsoft’s small step at a time approach to converting to window-based programs, versus their major competitors’ all at once approaches—even if today the standard by which Windows remains judged is how well it has imitated its predecessors). Another issue though which can save a paradigm shift is if different parts of a system have very different learning curves. This is one of the reasons that hardwired logic made a comeback in the 1980s. Prior to that time, it was possible to make a much faster (if small) ROM than any practical RAM, which favoured microcode as an instruction set implementation strategy. However, memory technology changed, and faster RAMs made it hard to achieve competitive performance with microcode. Whatever speed of microstore (as the memory used for microcode was called) could be used, an equally fast if not faster memory could be used for caches. Mismatches in learning curves in different parts of a system therefore present potential opportunities for paradigm shifts at the implementation level, particularly performance-oriented changes in design models. 2.5 Measuring and Reporting Performance Read and understand Section 1.5 of the prescribed book. This section contains a few highlights to guide you. 18

First, the most important measure to the user is elapsed time, or response time. Look at the example given of output from a UNIX time command and note that CPU time is but one component of the elapsed (“wall clock”) time, and can be a relatively small fraction at that. The discussion on benchmarks is interesting: make sure you understand in particular why real programs make the best benchmarks, and how the SPEC benchmarks are reported on (including the use of geometric means). Note the definitions of latency and throughput (also called bandwidth, particularly in input-output, I/O). Note also the discussion on comparing times, and what it means to say “X is n times the speed of Y”. Page ahead to other definitions of speed comparison, including speedup, and what we mean when we say something is n% faster than something else. Write definitions here (in all cases, for X and Y, run times are respectively tX and tY): X is n times the speed of Y

speedup of X with respect to Y

X is n% faster than Y

Important: “how much faster” is a difference measure, whereas “speedup” is a multiplicative measure (or a ratio). Make sure that you have this distinction clear. To be sure there is no confusion, I recommend sticking with exactly two measures: •

speedup—original time / improved time — a ratio



percent faster—(original time – improved time) / (original time) × 100 — a difference measure

A very common error is forgetting to subtract 100% from a difference measure. For example, if one machine’s run time is half that of another, it is 50% faster—not 100% faster. If it was 100% faster, it would take no time at all to run.

19

2.5.1 Important Principles There are two important guiding principles in computer system design: •

make the common case fast—it’s hard to make everything fast (and expensive) so when making design choices, favour the things that happen most often; not only is this where the biggest gains are to be had, but the frequent cases are often simpler



measure everything when calculating the effect of an improvement—note that the speedup formula, also called Amdahl’s Law, requires that you take into account the entire system when measuring the effect of speeding up part of it: work through the given examples (see Exercise 2.3 for an example not involving computers of how misleading calculating speedup without taking all factors into account can be)

These two principles illustrate how hard the system designer’s job can be. Focussing only on the obvious bottlenecks and common operations can only take you so far before the less common things start to dominate execution time. This was a lesson learnt the hard way in the Illiac-IV supercomputer design at the University of Illinois, where I/O was neglected in the quest for maximizing CPU speed, with the result that the system spent a large fraction of its time waiting for peripherals instead of computing. The Illiac fiasco was one of Amdahl’s inspirations in devising his famous law. 2.6 Quantitative Principles of Design Some fundamentals out of the way, what are the issues that contribute to run time? Let’s start with a simple model of CPU performance; more detail will follow in later chapters. The CPU performance equation in its simplest form is given on p 38 of Hennessy and Patterson: timeCPU

=

CPU cycles for a program × clock cycle time or

CPU cycles for a program timeCPU = -------------------------------------------------------------clock rate Note also how, for a given program, the CPU time can further be decomposed into clock cycle time, number of instructions executed (instruction count: IC) and clock cycles per instruction (CPI)—and what factors impact on these three measures. The other forms of formula are given for cases where it’s useful to distinguish time used by different types of instruction: a given program will use an instruction mix, frequencies

20

of execution of the available instruction. Note that the frequency can be measured in three different ways: •

static frequency—the frequency of each instruction in a program, as measured from the code without running it



dynamic frequency—the fraction of all instructions executed that a given instruction executes



fraction of run-time—fraction of execution time accounted for by a given instruction, which is essentially the dynamic frequency weighted for the relative time a given instruction takes to execute

It is important to distinguish these three measures and avoid confusing them. Work through the given examples, including the examples of instruction design alternatives. Also note the discussion on measuring CPU performance. Although the book refers to CPI, most currently shipping processors can (at least at peak) complete more than one instruction per clock cycle, which leads to a new measure, instructions per clock (IPC), which is just CPI-1. Section 1.7 of the book is a good instroduction of pipelining issues; if you want to be prepared for coverage of these concepts for later in the course, work through this section now. 2.7 Examples: Memory Hierarchy and CPU Speed Trends To put everything in context, here is a discussion of memory hierarchy, and why memory hierarchy is becoming an increasingly important issue, given the current speed (learning curve) trends. Generally, it is easier to make a small memory fast. More components can be used per bit, switching speeds can be made faster, and transmission lines kept short. All of this adds up to more cost and more power consumption as well. Fortunately a principle called locality of reference allows a designer to build memory hierarchically, with a small, fast memory at the top of the hierarchy designed to have the highest rate of usage, with larger, slower memories lower down designed to be used less often. Ideally, the effect should be as close as possible to the cost of the cheapest memory, and the speed of the fastest. A cache today is

21

typically built in at least two levels (32Kbytes or more at the first level or L1; 256Kbytes or more at the second level, L2). A cache is a small, fast memory located close to the CPU (even in many cases, on the same chip for L1). Caches are usually made with static RAM, which uses more components per bit than DRAM, but is faster. A cache is organized into blocks (often also called lines), and a block is the smallest unit fetched into a cache. When a memory reference is found in the cache, it is called a hit, otherwise a miss occurs. In the event of a miss, the pipeline may have to stall for a miss penalty, an extra amount of time relative to a hit (note that some current designs have non-blocking misses, i.e., the processor carries on executing other instructions as long as possible while waiting for a miss but for now, assume a miss has a fixed penalty). When we do memory hierarchy, we will do more detail of caches. For now it is sufficient to think through how to apply a simple variation of Amdahl’s Law to caches, and to predict the impact of cache misses on overall CPU performance. To return to the theme of the course, let’s consider the impact of increasing the CPU clock speed (equivalently, reducing the cycle time) while keeping memory speed constant. Although this is not in practice what is happening (CPU speed doubles roughly every 18 months, while memory speed improves at 7% per year), the effect of the differences in growth rates is that the cost in lost instructions for a DRAM reference doubles roughly every 6 years. Exercise 2.4 requires that you explore the effect of doubling a miss penalty. The kind of numbers you should see in the exercise should be an indication of why caches are becoming increasingly important—as well as why you do not see a linear speed gain with MHz, when the memory hierarchy is not also sped up. 2.8 Further Reading For further information on standard benchmarks, the following web sites as noted in the previous chapter are useful: •

Transaction Processing Council



Standard Performance Evaluation Corporation

22

Another good site to look for information is Microprocessor Report’s web site (Microprocessor Report is a trade magazine, widely read by system designers and architecture researchers; mostly considered unbiased but MPR appears to have rather oversold the IA64, considering its schedule slips): •



Dulong [1998] discusses some features of the Intel IA-64 architecture, and some performance results for limited aspects of the architecture have been published [August et al. 1998]. VLIW is briefly discussed in Section 4.4 of the book, starting on page 278. Some researchers [Wulf and McKee 1995, Johnson 1995] have claimed that we are approaching a memory wall, in which all further CPU speed improvements will be limited by DRAM speed. RAMpage is a new strategy for memory hierarchy, in which the lowest level of cache becomes the main memory and DRAM becomes a paging device, to improve effectiveness of the lowest-level SRAM [Machanick 1996, Machanick and Salverda 1998, Machanick et al. 1998, RAMpage 1997]. Ted Lewis [1996a] discusses learning curves in more detail. 2.9 Exercises 2.1. In a car magazine, separate times are given for acceleration from a standing start to various speeds, time to reach a quarter mile or kilometre from a standing start, braking time from various speeds and overtaking acceleration (e.g. time for 80 to 100 km/h). Since all these numbers are smaller for better performance, it might be tempting for someone from a computer benchmarking background to add them all for a composite score. a. Think of some reasons for doing this, and for not doing this. b. What do these arguments tell you about the trend in computer benchmarks to come up with a single composite number (MIPS, SPECmark, SPECint, SPECfp, Bytemark) as opposed to reporting all the individual scores? 2.2. Explain why a geometric mean is used for combining SPEC scores. Give an example where an arithmetic mean would give unexpected results. 2.3. Here is an example of Amdahl’s law in real life. The trip by plane from Johannesburg to Cape Town takes 2 hours. British Airways, hoping to muscle in on the local market, brings a Concorde in, and cuts the flying time to 1 hour. Calculate the speedup of the overall journey given the following additional data, which remains unchanged across the variation in flying time: • time to get to airport: 30 min •

time to check in: 30 min

23



time to board plane: 20 min



time to taxi to takeoff: 10 min



time to taxi after landing: 10 min



time to leave plane: 30 min



time to collect luggage: 20 min



time from airport to final destination: 30 min

If a revolutionary new transport technology was invented which could pick you up from your home, but where the actual travel time from Johannesburg to Cape Town was 4 hours, what would the speedup be: a. with respect to both the original travel time, and b. the Concorde travel time? c. Now restate both answers as a percentage faster. 2.4. Redo the following example with double the miss penalty (50 clock cycles): Base CPI 2.0; only data accesses are loads or stores which are 40% of all instructions; miss penalty is 25 clock cycles, miss rate is 2%. • the CPU time relative to instruction count (IC) and clock cycle time is: CPU cycles + memory stall cycles × clock cycle time where memory stall cycles are IC × memory references per instruction × miss rate × miss penalty = IC × (1 + 0.4) × 0.02 × 25 •

so total execution time is (IC × 2.0 + IC + 0.7) × clock cycle time = 2.7 × IC × clock cycle time

a. How much slower is the machine now compared with before (choose an appropriate difference or ratio measure), and b. compared with the case where there are no misses (leave out the miss cycles from the given calculation)? c. How much must the miss rate be reduced to, to achieve the same performance as when the miss penalty was 25 cycles?

24

Chapter 3 Instruction Set Architecture and Implementation There’s much confusion about the difference between RISC and CISC architectures, and whether there’s such a thing as a CISC architecture with RISC-like features. It’s important to understand what these various concepts mean from the perspective of the hardware designer, which provides greater clarity about the technical issues from the perspective of the user. 3.1 Introduction This chapter covers principles of instruction set architecture, i.e., design choices in creating instruction sets and the architecture seen in programs (registers, memory addressing, etc.), and covers implementation issues, especially as regards how the ISA relates to performance. Issues covered include pipelining and instruction-level parallelism. It is assumed here that you have a basic understanding of the major components of the microarchitecture: registers, pipelines, fetch, decode and execution units. If you don’t, please review this material; Section 1.7 of the book is a nice summary of the basics of pipelining. The major focus here is on issues which impact on improving performance: those factors which are influential in pushing the learning curve, inhibiting progress, and limiting eventual speed improvements. Material for this chapter of the notes is drawn from Chapters 2, 3 and 4 of the book. This chapter aims to lead you to understanding new development in the field, including benefits and risks of potential new ideas. You should aim by the end of the chapter to be in a position to evaluate designers’ claims for future-generation technologies critically, and to be able to assess which claims are most likely to be delivered.

25

The remainder of the chapter is structured as follows. Section 3.2 covers instruction set design principles, and clarifies the differences between RISC and CISC. Section 3.3 discusses issues arising out of the previous section as relate to pipeline design, with some forward references to issues in Section 3.4, which covers techniques for instruction-level parallelism. Section 3.5 discusses factors which limit instruction-level parallelism, with discussion both of current designs and potential future designs. To tie everything together, Section 3.6 relates the chapter to the focus of the course, longer-term trends. The chapter concludes with the usual section on further reading. 3.2 Instruction Set Principles: RISC vs. CISC This section summarizes Chapter 2 of the book. You should make sure you are familiar with the content as this is essential background. However, the general trend towards RISC-style designs reduces the need to be aware of a wide range of different architectural styles. What is RISC? The term reduced instruction set computer originated from an influential paper in 1980 [Patterson and Ditzel 1980], which made a case for less complex instruction sets. However, the term has become so abused that some (including the authors of the book) argue instead for the term “load-store instruction set”. The reason for this is that the RISC movement was fundamentally about simpler instruction set architectures (ISA) to make it easier to implement performance-oriented features like pipelines. Since the idea originated, some have perverted it to imply that implementation features like pipelines are “RISC-like”. By this definition, almost every processor designed with performance in mind since the 1970s would be “RISC-like”. For example, some argue that the Pentium and its successors are “RISC-like” because they make aggressive use of pipelines. Yet the complexity of implementation of pipelines on these processors exactly illustrates the RISC argument: if Intel had an instruction set designed for ease of implementing pipelines (which would have been more “RISC-like”), they would have been able to achieve the performance of the Pentium, Pentium Pro, Pentium II, etc. much more easily. The fact that these processors are implemented using techniques common to RISC processors doesn’t make them “RISC-like”; in fact, the complexity of the implementation is exactly why they are not “RISC-like”.

26

What then are features that the RISC movement has identified as critical to ease of achieving performance goals? Here are some: •

load-store instruction set—memory is only accessed by load instructions (move data from memory to a register) and store instructions (move data from a register to memory)



minimal addressing modes—have as few as possible, ideally only one, way of constructing a memory address in a load or store instruction



large set of registers—have a large number of registers (32 or more)



general-purpose registers—don’t dedicate registers to a specific purpose (e.g. tie a specific register or subset of registers to a specific instruction or instruction type)



fixed instruction length—instructions are all the same length even if this sometimes wastes a few bits



similar instruction execution time—as far as possible, instructions take the same amount of time in the execute stage of the pipeline

There are some variations between RISC processors, but all are pretty similar in these respects. Others have gone further in simplicity. Most pre-RISC processors had condition codes: special flags that could be set by many different instructions, and tested in conditional branches. Some RISC designs instead store the result of a test instruction in a register. This way, a boolean operation is no different from other arithmetic, and the order in which instructions appear is less important. This brings us to design goals behind the list of features of RISC machines: •

no special cases—if all instructions can be handled the same way as far as possible except for things which have to be different (e.g. some reference memory, whereas others use the floating-point unit, etc.), a pipeline’s logic can be kept simple and easy to implement



minimize constraints on order of execution—again an aid to the pipeline designer: if ordering constraints are minimized, instructions can be reordered (either statically, by the compiler, or dynamically, by the hardware) for most efficient processing by the pipeline



minimize interactions between different phases of instruction execution—again to aid pipelining: for example, if an instruction can both address memory and do arithmetic, some aspect of the ALU (arithmetic and logic unit) may be needed in the address computation, making processing two successive instructions complicated) 27



keep instruction timing as predictable as possible—complex addressing modes or instructions that can have variable numbers of executions of iterations (e.g. string operations) make it hard for the pipeline to predict in advance how long an instruction will execute

With these issues in mind, let us know review common ISA features, contrasting the RISC approach with the Intel IA-32 (since the new 64-bit Itanium or Merced design was announced, Intel has differentiated the old 32-bit processor line as IA-32, as opposed to the Merced IA-64) approach. 3.2.1 broad classification Since the way registers are used is central to the organization of work in the CPU, the broadest classification of an ISA is often in terms of the way registers are used: •

accumulator—an accumulator is a register used both as a source and a destination; the display of a pocket calculator is an example of this strategy. An accumulator architecture can have low memory traffic, since the instructions can be compact (often the accumulator is not specified as it is the only register for that instruction type), but a larger fraction of the memory traffic is data moves to and from memory, since the accumulator cannot store results non-destructively



stack—not strictly a register machine at all (though some stack machines used a few registers to represent the topmost elements of the stack for enhanced speed): a stack machine purely does arithmetic using the topmost item or items on the stack, and data is moved to or from ordinary memory by pop and push instructions, respectively; a stack machine can have very compact instructions since operands are not specifically specified, but can have a lot of memory traffic since the right things have to be on top of the stack for each operation



register-memory—instructions can combine a memory access and use of a register to perform an operation



general-purpose register (GPR)—registers can be used for any operation without arbitrary limit; in its purest form, a load-store architecture, in which data can only be moved to or from memory in stores and loads (respectively), and no registers (except possibly the stack pointer, used for the procedure-call stack and the program counter, used to keep track of the next instruction to execute) are dedicated to a specific purpose

Figure 2.2 on p 102 of the book contains some examples of some variations on these definitions, and Figure 2.4 on p 104 summarizes the advantages and disadvantages of three variations. 28

A RISC architecture falls into the last category, the GPR machine, with a load-store instruction set. Which does the Intel IA-32 fall into? Curiously enough, several. Although it looks a bit like a GPR instruction set in that some registers can be used for most operations, some registers are dedicated to specific operations, and make it look more like an accumulator machine. The floating point unit on the other hand uses a stack architecture, and there are also some register-memory instructions. Since the IA-32 fails some of the other critical tests for a RISC architecture, such as having multiple memory addressing modes and variable-length instructions, if is clearly not a RISC design. Why is it so complex? When its earliest version, the 8088/8086 came out, processor design was not as well understood as today, and complex instruction sets were common. Also, memory was expensive, and an instruction set that reduced memory requirements by doing multiple things in one instruction and by making it possible to have shorter instructions for common operations was seen as desirable. In addition, the IA-32 evolved over time, and earlier design decisions could not easily be undone, since older programs relied on them. The rest of Chapter 2 of the book covers addressing modes, type and size of operands, encoding an instruction set, and compiler issues (all of which you should understand). The example of the MIPS architecture is instructive because it is one of the simplest RISC architectures, and makes it relatively easy to see the principles without extraneous detail. 3.3 Challenges for Pipeline Designers Given that RISC is meant to make life easier for the pipeline designer, what are the issues that make life harder in designing a pipeline? There are two major concerns: •

issues that stall the pipeline



issues that make it complicated to treat instructions uniformly

29

Let’s take each of these in turn. If you have forgotten the basics of pipelining, please review Sections 1.7 and 3.1 of the book. Note that Hennessy and Patterson tend to use a specific five-stage pipeline in their books but other organizations are possible (e.g., some PowerPC models use a 4-stage pipeline, and both the MIPS R4000 line and the latest designs from AMD have a deeper pipeline). 3.3.1 Causes of Pipeline Stalls In the ideal case, a pipeline (assuming a simple model for now where at most one instruction can execute, a scalar pipeline) that has n stages can execute one instruction per clock cycle which (excluding overheads of pipelining) is 1/n the cycle time that a non-pipelined machine can run. In other words, an ideal pipeline can give approximately n times speedup over a non-pipelined implementation, all else being equal. In practice, there is some overhead in the pipeline, and keeping the clock timing across multiple stages becomes a problem (clock skew across the pipeline results in timing problems between events at either end of the pipeline). These issues limit the practical scalability of the pipeline (i.e., the maximum useful size of n) even in the ideal case where the pipeline can be kept busy constantly. Unfortunately a number of problems can arise in attempting to keep the pipeline going at full speed. The general description of the case where the pipeline doesn’t do something useful on a cycle is a stall. A stall specifically refers to the case where the pipeline is idled: though there are also cases in recent designs, like speculative execution, where the pipeline appears to be busy but the work it does is discarded, not necessarily with any stalls involved, and even in a simple pipeline, a piece of work may be started then abandoned, which results in wasted cycles which are not called stalls. The best way of thinking of a stall is to think of it as a bubble in the pipeline. A stall is most commonly caused by a hazard, an event that prevents execution from continuing unhindered. See Section 1.7 of the book for a description of the three categories of hazards: •

structural—limits on hardware resources prevent all currently competing requests from being granted



data—a data dependency prevents completion of an instruction before another in the pipeline completes 30



control—a change in order of execution (anything that changes the PC)

Make sure you understand the modified speedup formula and examples on pp 53-57. Read through and understand the material up to p 69, but note that modern designs are moving increasingly to dynamic scheduling, i.e., the hardware can reorder instructions to reduce hazards, as is covered in Section 3.4. Compiler-based approaches are limited in that it’s hard to generalize them (i.e. to go beyond specific cases), and some cases are hard to detect statically (e.g., they depend on whether a given branch is taken or not). Make sure you understand the diagrams used to illustrate pipeline behaviour, and how forwarding works. The description of the implementation of DLX is useful for later examples, but concentrate on the examples: read through and understand the rest of the chapter. 3.3.2 Non-Uniform Instructions Variations in instructions—whether length of time an instruction takes to execute, how they address memory or the length or format of the instruction—make life difficult for the pipeline designer. Let’s look at each of these issues in turn. From the pipeline designer’s point of view, if each pipeline stage takes exactly the same amount of time, this is the ideal case. Everything can flow smoothly through the pipeline. As one instruction is fetched, the next can be fetched, and the same with decoding, executing, etc. Not only is the design easy, but time isn’t wasted. If one instruction takes four clock cycles in its execute stage, others have to wait for it to complete (unless more sophisticated techniques are used, as in Section 3.4). For example, some floating point instructions are hard to execute in one clock cycle, and there’s little that can be done about this, except possibly converting each instruction to a sequence of simpler ones. However, other kinds of instruction can be artificially made variable even within one instruction (for example, some processors have instructions that can move multiple numbers of bytes, where the byte count is also an operand). The pipeline designer in this case has to use complex logic to spot some cases and vary the behaviour to suit them, possibly at the expense of increasing cycle time (because of the extra logic that has to spot these special cases early enough to handle them). Allowing many variations in memory addressing modes causes similar problems. The pipeline designer has to find out early enough that an addressing mode that may cause the

31

pipeline to stall is involved, and stop other instructions. Again, the extra complexity may make it hard to scale up cycle time. Variations in instruction length or format are also a serious problem for the pipeline designer, especially for superscalar architectures (again, see Section 3.4), where multiple instructions must be loaded at once. Just finding instruction boundaries becomes a problem, and inhibits the clean separation between pipeline stages for loading an instruction and decoding it (the instruction fetch unit has to have some idea what kind of instruction was just fetched to know if it has fetched the right number of bytes for the current instruction, and to know if it has started fetching the next instruction at the right place in memory). 3.4 Techniques for Instruction-Level Parallelism There are two major approaches for increasing ILP in a pipelined machine: •

deeper pipeline—for example, the MIPS R4000 has an 8-stage pipeline, and is called superpipelined: unlike a standard pipeline, some conceptually atomic operations are split into two stages, particularly memory references (including instruction fetches)



superscalar execution—also called multiple issue—“issue” is the transition from decode to execute in the pipeline

The two alternative strategies introduce different kinds of problem. A superpipelined architecture doesn’t overlap exactly similar phases of execution of different instructions, but a long pipeline introduces more overhead (the overhead between stages) and makes it difficult to maintain a consistent clock across stages (clock skew occurs when timing across the stages is significantly different as a result of propagation delays in the clock signal). A multiple-issue architecture requires multiple functional units, the units that carry out a specific aspect of the functionality of execution of an instruction. A functional unit may be an integer unit or a floating-point unit (sometimes also a more specialized component). Since integer and floating-point operations are in any case done by separate functional units, it is a relatively cheap step to allow the two to be dispatched at once. Another common split is to allow a floating point multiply and add to be executed simultaneously. This kind of split, a functional ILP, is relatively cheap because it doesn’t require a duplicated functional unit, though more bandwidth from the cache and more sophisticated internal CPU

32

buses are required. A more sophisticated form of superscalar architecture is one which allows similar ILP: in this case, functional units have to be duplicated. In any case, when more than one instruction can be issued, the opportunities for resource conflicts and hence hazards increases. A superscalar architecture therefore presents more challenges for pipeline scheduling, choosing the best possible ordering for instructions. Early superscalar architectures required that the compiler do the scheduling (i.e., static scheduling). However, static scheduling has two big drawbacks: some decisions require runtime knowledge (e.g. whether a branch is taken, or whether two different memory-referencing instructions actually reference the same location), and the ideal schedule for a given superscalar pipeline may not be ideal for a different implementation. For example, a lot was said about optimizing code specifically for the Pentium when it first appeared; any work put into this area was not particularly relevant for the Pentium Pro and its successors. Dynamic scheduling uses runtime information to determine the order of issue, according to available resources and hazards. A dynamic scheduling unit is approximately of the same complexity as a functional unit, which makes it possible to determine the trade-offs in deciding when to go to dynamic scheduling. When an extra functional unit would increase the peak parallelism above the level achievable with static scheduling, it becomes worth sacrificing the silicon area that an additional functional unit would have needed. As with many ideas in RISC microprocessors, dynamic scheduling owes much to work done by Seymour Cray in the 1960s, on the Control Data 6600. His dynamic scheduling unit was called a scoreboard. Make sure you understand the presentation of the scoreboard in Section 4.2 of the book, as well as the alternative, Tomasulo’s algorithm. Interestingly, some of the ideas have started to reappear as novel solutions in the microprocessor world, including register renaming. Also read understand 3.3 (branch prediction) and the 3.5 which deals with speculative execution. The Intel IA-64 (code-named Merced; first implementation Itanium; see 4.7) design is meant to present an attempt at improved exploitation of ILP. Some of the techniques used include:

33



VLIW—several instructions are packaged together as one extra-long instruction (initially 3 instructions of 40 bits in a 128-bit bundle; the remaining 8 bits are used by the compiler to identify parallelism)



explicitly parallel instruction computing (EPIC)—instructions are tagged with information as to which can be executed at once, both within a VLIW package and surrounding instructions



predicated instructions—if a predicate with which the instruction is tagged is false, the instruction is treated as a NOP



high number of registers—there are 64 predicate, 128 general-purpose (integer) and 128 floating-point registers, to reduce the need for renaming



speculative loads—a load instruction can appear before it’s clear that it’s needed (e.g. before a branch), followed later by an instruction that commits the load. If the load would have resulted in a page fault or error, the interrupt is deferred until the commit instruction, which may not in fact be executed (depending on the branch outcome): this is a form of non-blocking prefetch

While the IA-64 contains some interesting ideas, it is not clear until it actually ships in volume whether they will be a win. Each time the predicted shipping date slips, the probability of its being faster than competing designs is reduced (it became available in limited quantities in 2000). Some potential problems with the design include: •

VLIW—VLIW is not a new idea and most architecture researchers consider it to have flopped; whether Intel can produce a compiler that can do what is implied by the EPIC model is an open question



IA-32 support—a design goal is to support the existing Intel architecture as well; whether this can be done on the same chip without compromising performance remains untested



wins from predication—work published to date is not convincing. Given the short typical basic block in integer code, most predicated instructions will have to be executed speculatively, which is not much different from a conventional branch-based implementation of conditional code



value of increased register count—while eliminating the need for register renaming is a win, increasing the bits needed from 5 (for 32 registers) to 7 (for 128 registers) is a significant cost, especially as ALU operations use three registers (i.e., the bits to encode registers is increased by 6 in an ALU instruction)

34

This brings us to the question of what limitations designers are in fact up against. 3.5 Limits on Instruction-Level Parallelism Early studies on ILP showed that the typical available ILP is around 7 [Wall 1991], reflecting the fact that integer programs typically have small basic blocks (a sequence of code with only one entry and one exit point), often only about 5 instructions. Floating-point programs typically have much larger basic blocks and can in principle have higher ILP but, even so, there can be low limits unless extra work is done to extra parallelism. One of the hardest problems is dealing with branches. Straight-line sequences of code (as we will see in Chapter 4) can be fetched from the memory system quickly but any change in ordering makes it hard for the memory system to keep up. The combination of branch prediction and speculative execution helps to reduce the times the CPU stalls (although it may sometimes waste instruction execution when it misspeculates—as is also of course the case with IA-64’s predicated instructions). Gains from speculation and predication can occur in conditional code, and for speculation, also in loops. If branch prediction is done right, the combined effect is that of unrolling a loop, an optimization a compiler can also sometimes perform. One of the issues Merced is aimed at addressing is increasing the window size—the number of instructions examined as a unit—over which ILP is sought. The Intel view is that a compiler can work with a larger window than the hardware can. Figure 3.45 in the book illustrates the effect of window size. Notice how the first three benchmarks, integer programs, have relatively little ILP for any realistic window size. The floating-point programs, while better, range from 16 to 49 for a window size of 512, which is probably unrealistically large. Make sure you understand the issues in Section 3.6 of the book. 3.6 Trends: Learning Curves and Paradigm Shifts Some architecture researchers believe that Intel is on the right track in looking for compilerbased approaches to increase parallelism, but also that Intel is on the wrong track in looking relatively locally for parallelism (in a bundle of instructions or its neighbours).

35

For example, in a limit study (a study which drops constraints, without necessarily being practical) carried out at the University of Michigan in 1998, it was found that many data hazards are an artifact of the way the stack is used for stack frames in procedure calls. After the return, the stack frame is popped, and the next call uses the same memory. What appears to be a data dependency across the calls is in fact a false dependency: one which could be removed by changing the way memory is used. In this study, a very much higher degree of ILP was found than in previous studies. Although the study does not reveal how best to implement this high degree of ILP, it does suggest that moving from a sequential procedural execution model to a multithreaded model may be a win. This reopens an old debate in programming languages: at what level should parallelism be expressed? Here are some examples: •

procedural—implement threads or another similar mechanism as a special case of a procedure or function (e.g. Java)



instruction-level—find ILP in conventional languages or better still in a language designed to expose fine-grained parallelism, but leave it to the compiler rather than the programmer to find the parallelism



package—add parallel constructs as part of a library (typically the way threads etc. are implemented in object-oriented languages like C++ or Smalltalk, if they are not a native feature as in Java)



statement-level—a variant on the ILP model, in which the programmer is more conscious of the potential for parallelism, but any statement (or expression in a functional language) may potentially be executed in parallel

In practice, the biggest market is for code which is already compiled. Recompilation or worse still, rewriting in a new language, puts the architect into paradigm-shift mode, which remains a hard sell for now. If limits on ILP become a real problem, it’s possible that models that require new languages may start building a following, but this is a prediction that has been made often in the past and it didn’t happen (unless you call switching from FORTRAN to C++ a paradigm shift, which it is at a certain level). An intermediate model, a new architecture that requires a recompile, is easier to sell if there is a transition model. The Alpha proved that it could be done: it came with a very efficient translator of VAX code, which allowed VAX users to transition to the Alpha without

36

having to recompile (sometimes hard: the source could be lost, or could be in an obsolete version of the language). Rather than incorporate an IA-32 unit in the IA-64, it may be a better strategy for Intel to write a software translator from IA-32 to IA-64. But the IA-64 project would be a lot more convincing if something shipped. Lateness is never good for paradigm shifts, especially when others have a learning curve they are sticking to and are not late. In this respect, Intel should try to learn from Exponential (see Subsection 2.4.3.1). 3.7 Further Reading The first paper to use the RISC name appeared in an unrefereed journal [Patterson and Ditzel 1980], indicating that influential ideas sometimes appear in relatively lowly places. An early study of ILP [Wall 1991] put the limit relatively low, reflecting the small basic blocks in integer programs. Look for more recent papers in Computer Architecture News and the two big conferences, ISCA and ASPLOS, for work which raises the limits by relaxing assumptions or improved software or architectural features. On the Intel IA-64 relatively little has been published but you can find some preliminary results on aspects of the EPIC idea [August et al. 1998], and an overview of limited aspects of the architecture [Dulong 1998]. 3.8 Exercises 3.1. Redo the example of pp 260-261 of the book, but this time with a. window size of the speculative design of 32 b. window size of 64 again, but with double the miss penalty c. What do the results tell you about the importance of memory in scaling up the CPU? d. What do the results tell you about the effect of window size? 3.2. Do the following questions from the book: a. 1.19, 3.7, 3.8, 3.11, 3.17, 3.13. 3.3. Find at least one paper on superscalar design (particularly one featuring speculation and branch prediction), and compare it against the reported results for the EPIC idea [August et al. 1998]. Do the results as reported so far look promising compared with other approaches?

37

38

Chapter 4 Memory-Hierarchy Design Although speeding up the CPU is the high-profile area of computer architecture (some even think of the ISA as what is covered by the term “architecture”), memory is increasingly becoming the area that limits speed improvement. Consider for example the case where a processor running at 1GHz is capable of issuing 8 instructions per clock (IPC). That means one instruction for every 0.125ns. An SDRAM main memory takes about 50ns to get started, then delivers subsequent references about every 10ns. A rough ballpark figure for moving data into the cache is about 100ns (the exact figure depends on the block size and any buffering and tag setup delays). If 1% of instructions result in a cache miss, and the pipeline otherwise never stalls, the processor actually executes at the rate of 1.125ns per instruction—9 times slower than its theoretical peak rate. Or to put it another way, instead of an IPC of 8, it has an CPI of 1.125. 4.1 Introduction This chapter covers the major levels of memory hierarchy, with emphasis on problem areas that have to be addressed to deal with the growing CPU-DRAM speed gap. The book takes a more general view; please read Chapter 5 of the book to fill in gaps left in these notes. This chapter starts with a general view of concepts involved in finding a given reference in any hierarchy of the memory system: what happens when it’s there, and when it’s not. Next, these principles are specialized to two areas of the hierarchy, caches and main memory. Issues in virtual addressing are dealt with in the same section as main memory, although there are references back to the discussion on caches (as well as in the cache section). The Trends section outlines the issues raised by the growing CPU-DRAM speed gap, which leads to a section on alternative schemes: different ways of looking at the hierarchy, as well as improvements at specific levels.

39

4.2 Hit and Miss Any level in the memory system except the lowest level (furthest from the CPU: biggest, slowest) has the property that a reference to that level which can either be a read or a write may either hit or miss. A hit is when the location in memory which is referenced is present at that level, a miss is when it is not present. Memory systems are generally organized into blocks, in the sense that a unit bigger than that addressed by the CPU (a byte or word) is moved between levels. A block is the minimum unit that the memory hierarchy manages (in some cases, more than one block may be grouped together). At different levels of the hierarchy, a block may have different names. In a cache, a block is sometimes also called a line (the names are used interchangeably, but watch out for slightly different definitions). In a virtual memory system, a block is called a page (though some schemes have a different unit called a segment—often with variations on what this term exactly means). There are also differences in terms for activities: in a virtual memory (VM) situation, a miss is usually called a page fault. There are however some common questions that can be asked about any level of the hierarchy (this list adds to the book’s list on p 379): 1. block placement—where can a block be placed when moved up a level? 2. block identification—how is it decided whether there is a hit? 3. block replacement—which block should be replaced on a miss? 4. write strategy—how should consistency be maintained between levels when something changes? 5. hardware-software trade-off—is the contents of the level (particularly replacement strategy) managed in hardware or in software? Here are some other common issues that make the concept of a memory hierarchy work. The principle of locality makes it possible to arrive at the happy compromise of a large, cheap memory with a small, fast memory closer to the CPU, with the overall effect of a cost closer to the cheaper memory with speed closer to the expensive memory. Locality is divided into spatial locality: something close to the current reference is likely to be needed soon, and temporal locality: a given reference is likely to be repeated in the near future. A

40

combined effect of the two kinds of locality is the idea of a working set: a subset of the entire address space that is sufficient to be able to hold in fast memory. The size of the working set depends on over how long a time window you measure. Where the speed gap between the fast level and the next level down is small, the working set is measured over a smaller time than if the gap is large. For example, in a first-level (L1) cache with a reasonably fast L2 cache below it, the working set could be measured over a few million instructions, whereas with DRAM, where the cost of a page fault runs to millions of instructions, a working set may be measured over billions of instructions. To conclude this section, here is a little more basic terminology. Since performance is a factor of hits and misses between levels, designers are interested in the miss rate, which is the fraction of references that miss at a given level. The miss penalty is the extra time a miss takes relative to a hit. Read 5.1 in the book and make sure you understand everything. 4.3 Caches Make sure you understand the four questions as presented in the book. Here are a few additional points to note. Most caches until fairly recently were physically addressed, i.e., the virtual address translation had to be completed before the tag lookup could be completed. Some caches did use virtual addresses, so the physical address was only needed on a miss to DRAM. The advantage of virtually addressing the cache is that the address translation is no longer on the critical path for a hit, which cache designers aim to make as fast as possible, to justify the expensive static RAM they use for the cache (and of course to make the CPU as fast as possible). However, virtually addressing a cache causes a problem with aliases, different virtual addresses that relate to the same physical page. One use of aliases is copy on write pages, pages that are known to be initially the same, but which can be modified. A common use of copy on write is initializing all memory to zeros. Only one page needs actually be set to zeroes initially, and the copy on write mechanism is then set up so that as soon as another page is modified, it too is set to all zeros, except the bytes that the write modifies. The problem for a virtually addressed cache is that the alias has to be recognized in case there is a write. The MIPS R4000 has an interesting compromise in which the cache is virtually indexed but physically tagged. In this approach, the virtual address translation can be

41

done in parallel with the tag lookup. In general, virtual address translation, when it is coupled with cache hits, has to be fast. The standard mechanism is a specialized cache of page translations, called the translation lookaside buffer, or TLB (from the weird name, we can conclude that IBM invented the concept)—of which more in the next section. Read through Section 5.2 in the book, and make sure you understand the issues. In particular, work through the performance examples. Also go on to 5.3, and understand the issues in choosing between the various strategies for reducing misses. Make sure you understand the causes and differences between compulsory (or cold-start) misses, capacity and conflict misses. Another issue not mentioned in this section of the book is invalidations. If a cached block can be modified at another level of the system (for example in a multiprocessor system, where another processor shares the same address space, or if a peripheral can write directly to memory), the copy in a given level of cache may no longer be correct. In this situation, the other system component that modified the block should issue an invalidation signal, so the cache marks the block as no longer valid. Invalidations arise out of the problem of maintaining cache coherency, a topic which is briefly discussed in Section 5.9 of the book, and is discussed in more detail in Section 8.3. Another issue not covered in this part of the book (but see p 698 for further justification of the idea) is maintaining inclusion: if a block is removed from a lower level, it should also be removed from higher levels. Inclusion is not essential (some VM systems don’t work this way), but ensuring that smaller levels are always a subset of larger levels makes for significant simplifications. Two key aspects of caches are associativity and block size (also called line size). A cache is organized into blocks, which are the smallest unit which can be transferred between the cache and the next level down. Deciding where to place a given block when it is brought into the cache can be simple: a direct-mapped cache has only one location for a given cache block. In this case, replacement policy is trivial: if the previous contents of that block are valid, it must be replaced. However, a more associative cache, one in which a given block can go in more than one place, has a lower miss rate, resulting from more choice in which block to replace. An n-way associative cache has n different locations in which a given block can be placed. While associativity reduces misses, it increases the com-

42

plexity of handling hits, as each potential location for a given block has to be checked in parallel to see if the block is present. It is therefore harder to achieve fast hit times as associativity increases (not impossible: the PowerPC 750 has an 8-way associative first-level cache). Block size influences miss rate as well. As the size goes up, misses tend to reduce because of spatial locality. However, as the size goes up, temporal locality misses also increase because the amount displaced from the cache on a replacement goes up. Also, as block size goes up, the cost of a miss increases, as more must be transferred on a miss. There is therefore an optimal block size for a given cache size and organization (possibly to some extent dependent on the workload, since the balance between temporal and spatial locality may vary). Note also the variations on write policy. A write through cache never has dirty blocks, blocks which are inconsistent with the level below, because all writes result in the modification being made through to the next level. This makes a write through cache easier to implement, especially for multiprocessor systems. However, writing through increases traffic to memory, so the trend in recent designs is to implement write back caches, which only write dirty blocks back on a replacement (and hence need a dirty bit in the cache tags). Note also variations on policy for allocating a block on a write miss. Most modern systems have at least two levels of cache. In such cases, cache miss rates can be expressed as a global miss rate (the fraction of all references), or a local miss rate (the fraction of references that reach that level). The local miss rate for an L2 cache may be quite high, since it only sees misses from L1, which will usually be a low fraction of all memory traffic. It can also be useful (especially since most recent designs split the instruction and data caches in L1, the I- and D-caches), to give separate miss rates for data and instructions. It is also sometimes useful to split the read and write miss rates. Read misses may or may not include instructions, depending on the issue under investigation. Note that L2 or lower caches are usually not split between data and instructions. The reason for the split at L1 level is so that an instruction fetch and a load or store’s memory reference can happen on the same clock. With superscalar designs, doubling the bandwidth available to the caches by splitting between I and D caches becomes even more important at L1. Since L2 sees relatively little traffic, there is less need to use ploys like split I- and D- caches.

43

Emphasizing the importance of caches, Section 5.4 of the book contains even more detail, on reduction of miss penalty, and 5.5 reduction of hit time. Read through all this and make sure you can do all the quantitative examples. What is the answer to my question 5? Why? Can you answer all the other questions? Make sure you understand the way cache tags are used, including the bits for the state of a block, and to identify the address of the contents of a block. 4.4 Main Memory Main memory has traditionally1 been made of dynamic RAM (DRAM), a kind of memory that is slower than the SRAM used for caches. Unlike SRAM, it has a refresh cycle which slows it down, and its access method is slower. Read the description of memory technology, noting the way a typical DRAM cycle is organized (row and column access). Understand the discussion of improvements to DRAM organization—of which more in Section 4.5—including the calculation of miss penalty from the cache, wider memory and interleaved memory. Be sure you can do the quantitative examples. Note that the problem with any scheme that widens memory (including interleaving and multiple banks) is that it increases the cost of an upgrade. If the full width has to be upgraded at once, more chips are required in the minimal upgrade. Multiple independent banks are commonly used in traditional supercomputer designs, where cost is less important than performance. Some have hundreds, even thousands, of banks. When we do Direct Rambus in Subsection 4.6.2, you will see that, as with the RISC movement’s borrowing of supercomputer CPU ideas, the mass market is attempting to find RAM strategies that mimic supercomputer ideas in cheaper packaging. Since the 1980s, there have been various attempts at improving DRAM, taking advantage of a cache’s approach to spatial locality, relatively large blocks (which are in effect a prefetch mechanism: bringing data or code into the cache before it’s specifically referenced). Since the slow part of accessing DRAM is setting up the row access, it makes sense to use as much of the row as possible. Fast page mode (FPM) allows the whole row to be 1. If since the 1970s can be called “tradition”.

44

streamed out. Extended Data Out (EDO) is a more recent refinement, by which the data remains available when the next address is being set up. Synchronous DRAM (SDRAM) followed EDO: the DRAM is clocked to the bus, and is designed to stream data at bus speed after an initial setup time. All these variations do not improve the underlying cycle time, but they make DRAM appear faster, particularly if more than one sequential access is required. If the bus is 64 bits wide and a cache block is 32 bytes, 4 sequential accesses are required, so making those after the initial one faster is a gain. Some caches have blocks as big as 128 bytes, in which case a streaming mode of the main memory is an even bigger win. Again note the standard 4 questions, and my 5th question. In this instance, the memory is managed using a virtual memory (VM) model. The discussion of pages versus segments is largely of historical interest as almost all currently shipping systems use pages (including the Intel IA-32: the segmented mode is for backwards compatibility with the 286). Note though the idea of multiple page sizes, available in the MIPS and Alpha ranges. The TLB section is more important. Make sure you understand it. The TLB is a much-neglected component of the memory hierarchy when it comes to performance tuning. A program with memory references scattered at random, while having few cache misses because the references fit in the L2 cache, may still thrash the TLB (in one case, I found that improving TLB behaviour reduced of run time by 25%). Work through the Alpha example, and make sure you understand the general principles. 4.5 Trends: Learning Curves and Paradigm Shifts One of the most important issues in memory hierarchy is the growing CPU-DRAM speed gap. Figure 5.2 in the book illustrates the growing gap. Note that this is a log scale; on a linear scale, the growth in the gap is even more dramatic. What’s missing is a curve for static RAM, which is improving at about 40% per year, about the same rate as CPU clock speed (CPU speed overall is increasing faster because of architectural improvements, like more ILP). Given this growing speed gap, caches are becoming increasingly important. As a result, cache designers are being forced to be more sophisticated, and larger off-chip caches are becoming common.

45

Given that it’s hard (and expensive) to make a large, fast cache, it seems likely that there will be an increasing trend towards more levels of cache. L1 will keep up with the CPU, L2 will attempt to match the CPU cycle time (at worst, a factor of 2 to 4 times slower), and L3 can be several times slower, as long as it’s significantly faster than DRAM. If DRAM isn’t keeping up on speed, what keeps it alive? The dominant trend in the DRAM world is improving density, the number of bits you can buy for your money. Every 3 years, you can buy 4 times the DRAM for your money—a consistent long-term trend, even if there are occasional short-term spikes in the graph (e.g. when a new RAM-hungry OS is released, like the initial release of OS/2 or Windows 95). For this reason, DRAM remains the memory of choice for expandability, and for bridging the gap between cache and the next much slower level, disk. In Section 4.6, some alternative RAM and memory hierarchy organizations are considered. Here, a few improvements on caches are considered. The SRAM speed trend more or less tracks CPU clock speed (about 40% faster per year), but what of architectural improvements, particularly increased ILP? How can a cache keep up? The L2 (or even lower levels) in general can’t: the best that can be done is to keep up with CPU clock speed. The fastest L2 caches in use today are clocked at the CPU clock speed (for example, some PowerPC versions, particularly upgrade kits, have caches clocked at CPU speed—if CPU speed is somewhat slower on the PPC range than its Intel competitors), and L1 cache is generally clocked at the same speed as the CPU clock. Several techniques can be used to handle ILP. Where contiguous instructions are fetched (the common case, even if branches are frequent), a wider bus from the cache is a sufficient mechanism. The fact that loads and stores do not occur on every instruction puts less pressure on the data cache (recall that there are usually separate I and D caches at L1). However, the fact that memory referencing instructions may at times refer to very different parts of memory (e.g. when accessing simple variables or following pointers, as opposed to multiple contiguous array references, or accessing contiguously-stored parts of an object) makes a wider path from D-cache less of a win than is the case for the I-cache. For the I-cache, there is also the problem of instruction fetches on branches. To handle both problems, there is a variety of possible options:

46



multiported caches—a multiported memory can handle more than one unrelated reference to different parts of the memory at once



multibanked caches—like a main memory system with multiple banks, a multibanked cache can support unrelated memory access to different banks, staggered to avoid doing more than one transaction on the bus simultaneously



multilateral caches—an idea under investigation at the university of Michigan: the idea is to partition data into multiple caches, allowing the possibility that a cache hit can happen simultaneously in more than one cache, without the complexities of multiporting or multiple banks

All of these approaches have different advantages and disadvantages. A multiported cache is the best approach if the cost is justified, because it is the most general approach, but the other approaches are cheaper. Multiple banks are only a win if unrelated references happen to fall within different banks (the probability of this depends on the number of banks— more is better—and the distribution of the references). A multilateral cache also relies on being able to partition references, making reasonable predictions of which references are likely to occur in instructions which are close together in the instruction stream. 4.6 Alternative Schemes 4.6.1 Introduction There are many variations on DRAM being investigated. In general, the variations do not address the underlying cycle time of DRAM, but attempt to improve the behaviour of DRAM by exploiting its ability to stream data fast once a reference has been started. A key idea derives from the fact that the traditional RAM access cycle starts with a RAS signal, which makes a whole row of a 2-dimensional array of bits available. Once the row is available, it’s a relatively quick operation, as in traditional RAM access, to select a specific bit (CAS). Meanwhile, the remainder of the row is wasted. The improved DRAM models attempt to use the remainder of the row creatively (in effect caching it for future references). This section only reviews the one most likely to become commonly accepted, Direct Rambus, or Rambus-D. To show how else the problem of slow DRAM may be addressed, an experimental strategy called RAMpage is also briefly reviewed.

47

4.6.2 Direct Rambus Direct Rambus, or Rambus-D, is a development of the original Rambus idea. Rambus, as originally specified, was a complete memory solution including the boardlevel design of the bus, a chip set and a particular DRAM organization. The original idea was relatively simple: a 1 byte-wide (8 bit) bus was clocked at relatively high speed, to achieve the kind of bandwidth that a conventional memory system achieved. A typical memory system today has a 64-bit bus, 8 times wider than the original Rambus design. The rationale behind Rambus was that it was easier to scale up the speed of a narrow design even if it had to be several times faster for the same effective bandwidth, and it allowed upgrades in smaller increments. A byte-wide bus only needs byte-wide DRAM upgrades, whereas a 64-bit bus requires 8 byte-wide upgrades. The original Rambus design scored one significant design win, the Nintendo 64 games machine, based on Silicon Graphics technology (a cheapened version of the MIPS R4x00 processor is used). However, it was not adopted in any general-purpose computer system. Rambus, like other forms of DRAM, has the same underlying RAS-CAS cycle mechanism, but aimed to stream data out rapidly once the RAS cycle was completed (in effect treating the accessed row bits as a cache). Rambus-D has a number of design enhancements to make it more attractive. It is 2 bytes wide, doubling the available bandwidth for a given bus speed, and has a pipelined mode in which multiple independent references can be overlapped. Compared with a 100MHz 128-bit bus SDRAM memory, the timing is very similar. To achieve the same bandwidth on a bus 8 times narrower though, instead of a 10ns cycle time, the Rambus-D design of 1999 had a cycle time of 2.5ns, and could deliver data on both the rising and falling clock edge, delivering data at a rate of once every 1.25ns (or 800MHz). This gave Rambus-D a peak bandwidth of 1.5Gbyte/s, competitive with high-end DRAM designs of its time. However, as with SDRAM, there was an initial (comparable) startup delay, of 50ns. If this is compared against the time for data transfers of 1.25ns, this initial delay is substantial. This is the problem the pipelined mode was meant to address: provided sufficient independent references could be found, up to 95% of available bandwidth could

48

be utilized on units as small as 32 bytes. However, it has yet to be demonstrated that such a memory reference pattern is a common case. It’s possible to make a memory system with multiple Rambus channels. However, the initial time to start an access is not improved by this kind of parallel design (any more than is the case with a wide conventional bus). What is improved is the overall available bandwidth, which may be a useful gain in some cases. 4.6.3 RAMpage The RAMpage architecture is an attempt at exploiting the growing CPU-DRAM speed gap to do more sophisticated management of memory. The idea is that the characteristics of DRAM access are starting to look increasingly like the characteristics of a page fault, at least on very early VM systems, when disks and drums were a lot faster in relation to the rest of the system than they are today. For example, the first commercially available VM machine, the British Ferranti Atlas, had a page fault penalty of (depending how it was measured) a few hundred to over 1000 instructions. By contrast, a modern system’s page fault penalty is in the millions of instructions. What are the potential gains from a more sophisticated strategy? A hardware-managed cache presents the designer with a number of design issues which require compromises: •

ease of achieving fast hits—less associativity makes it easier to make hits fast



reducing misses—increasing associativity reduces misses by reducing conflict misses



cost—less associativity is cheaper since the controller is less complex, particularly where speed is an issue (more associativity means more logic to detect a hit, which makes for more expense in achieving a given speed goal as well as requiring more silicon to implement the cache)

Some recent designs have illustrated the difficulty in dealing with these conflicting goals. The Intel Pentium II series packages the L2 cache and its tags and logic in one package with the CPU. While this makes it possible to have a fast interconnect between the components without relying on the board-level design, it limits the size of the L2 cache. The PowerPC 750 has its L2 tags and logic on the CPU chip, which again constrains the L2 design (in this

49

case, it is limited to 1Mbyte though unlike the Pentium II, the PPC 750 design did allow for more than one level of off-chip cache). Another issue of concern is finding a suitable block size for the L2 cache. The design of DRAM favours large blocks to amortize the cost of setting up a reference (see Questions 4.3, 4.4 and 4.5). On the other hand, the block size to minimize misses is dependent on the trade-off between spatial and temporal locality. A large block size favours spatial locality, as it has a large prefetch effect, but can impact on temporal locality, as it displaces a larger fraction of the cache when it is brought in (assuming it caused a replacement, and wasn’t allocated a previously unused block). Increasing the size of a cache tends to favour larger blocks, since the effect of replacing wanted blocks is reduced as the size goes up. See p 394 of the book for data on the effect of block size. Also, minimizing misses is not the sole concern: a very large block size may reduced misses over a smaller size, but not sufficiently to recoup the extra miss penalty incurred from moving a large block. The premise of the Rampage model is that as cache sizes reach multiple Mbytes and miss penalties to DRAM start to increase to hundreds or even over a1000 instructions, the interface between SRAM and DRAM starts to look increasingly like a page fault. The Rampage model adjusts the status of the layers of the memory system so the lowest-level SRAM becomes the main memory, and DRAM becomes a paging device. Since a large SRAM main memory favours a large block (or now called page) size, the interface becomes very similar to paging. The major differences over managing the SRAM as a cache are: •

easy hits—a hit can easily be made fast since it simply requires a physical address; the TLB in the RAMpage model contains translations to SRAM main memory addresses and not to DRAM addresses



full associativity—a paged memory allows a virtual page to be stored anywhere in physical memory, which is in effect fully associative. Unlike a cache, the associativity does not cause complexity on hits



slower misses—misses on the other hand are inherently slower, as they are handled in software; the intent is that RAMpage should have fewer misses to compensate for this extra penalty



variable page size—the page size, unlike the block size with a conventional cache design, can relatively easily be changed, since it is managed in software (some support in the TLB is needed for variable page sizes, but this is a solved problem: several cur-

50

rent architectures including MIPS and Alpha already have this feature); this allows potential to fine-tune the page size depending on the workload, or even a specific program •

context switches or multithreading on misses—it becomes possible to do other work to keep the CPU busy on a miss to DRAM, which is especially viable with large SRAM page sizes; since something else is being done while waiting for a miss, using a large page size to minimize misses works, whereas with a cache, the miss penalty reduces the effectiveness of large blocks

The RAMpage model is an experimental design which has been simulated with useful results, suggesting that with 1998-1999 parameters, it is competitive with a 2-way associative L2 cache of similar size, despite the higher miss (page fault) penalty resulting from software management. Later work has shown that errors in the simulation favoured the conventional hierarchy, and RAMpage, as simulated, is a significant win at today’s design parameters. The growing CPU-DRAM speed gaps allows scope for interesting new ideas like RAMpage, and it is likely that results from a number of new projects addressing the same issues will be published in the near future. 4.7 Further Reading The impact of the TLB on performance is an area in which some work has been published, but more could be done [Cheriton et al. 1993, Nagle et al. 1993]. Direct Rambus [Crisp 1997] is starting to move into the mainstream now that Intel has endorsed it, and is likely to appear in mass-market designs. The RAMpage project has its own web site [RAMpage 1997], and several papers on the subject have been published [Machanick 1996, Machanick and Salverda 1998a,b, Machanick et al. 1998, Machanick 2000]. There has been a fair amount of work as well on improving L1 caches to keep up with processor design [Rivers et al. 1997]. A very useful overview of VM issues has appeared recently [Jacob and Mudge 1998], showing that even though this is an old area, good work is still being done. Look for further material in the usual journals and conferences. 4.8 Exercises 4.1. Add in to Figure 5.2 of the book a curve for SRAM speed improvement, at 40% per year. a. Which is growing faster: the CPU-DRAM speed gap, the CPU-SRAM speed gap, or 51

the SRAM-DRAM speed gap? b. What does your previous answer say about where memory-hierarchy designers should be focussing their efforts? 4.2. Caches are (mostly) managed in hardware, whereas the replacement and placement policies of main memory are managed in software. a. Explain the trade-offs involved. b. What could change a designer’s strategy in either case? 4.3. Assume it takes 50ns for the first reference, and each subsequent reference for an SDRAM takes 10ns (i.e., a 100MHz bus is used). Redo the calculation on p 430 of the book (assume the cycles in the example are bus cycles) under these assumptions for the following cases: a. a 32-byte block on a 64-bit bus b. a 128-byte block on a 64-bit bus c. a 128-byte block on a 128-bit bus 4.4. Assume Rambus-D takes 50ns before any data is shipped, and 2 bytes are shipped every 1.25ns after that. What fraction of peak bandwidth is actually used for each of the following (assuming no pipelining, i.e., each reference is the only one active at any given time): a. a 32-byte block b. a 64-byte block c. a 128-byte block d. a 1024-byte block e. a 4Kbyte block 4.5. Redo Question 4.4, assuming now that you have a 4-channel Rambus-D, i.e., the initial latency is unaffected, but you can transfer 8 bytes every 1.25ns. What does this tell you about memory system design?

52

Chapter 5 Storage Systems and Networks If memory hierarchy presents a challenge in terms of keeping up with the CPU, I/O and interconnects (internal and external) are even more of a challenge. A memory system at least operates with cycle times in tens of nanoseconds; peripherals typically operate in tens of milliseconds. As with DRAM where the initial startup time is a major fraction of overall time and it’s hard to achieve the rated bandwidth with small transfer units, I/O and interconnects have a hard time achieving their rated bandwidths with small units, but the problem is even greater. Amdahl’s famous Law was derived because of experience with the Illiac supercomputer project, where the I/O subsystem wasn’t fast enough, resulting in poor overall performance. A balanced design requires that all aspects be fast, as excessive latency at any level can’t be made up for at other levels. 5.1 Introduction This chapter contains a brief overview of Chapters 6 and 7 of the book. Since Networks are covered separately in a full course, Chapter 7 is only reviewed as it relates to general I/O systems, and as background for multiprocessors (see Chapter 6 of these notes). The major issues covered here are the interconnect within a single system, particularly buses, I/O performance measures, RAID as it relates to the theme of the course, operating system issues, a simple network, the interconnect including media, practical issues and a summary of bandwidth and latency issues. By way of example, in the Trends section (5.10), the Information Mass Transit idea is introduced as an example of balancing bandwidth and latency. For background, read through and understand Sections 6.1 and 6.2 of the book; we will focus on the areas where performance issues are covered.

53

5.2 The Internal Interconnect: Buses A bus in general is a set of wires used to carry signals and data; the common characteristic of any bus is a shared medium. Other kinds of interconnect include point-to-point—each node connects to every node it needs to communicate with directly—and switched, where each node connects to another it needs to communicate with as needed. Most buses inside a computer are parallel (multiple wires), but there are also common examples of serial buses (e.g. ethernet). Like any shared medium, a bus can be a bottleneck if multiple devices contend for it. Generally, off the processor, a system has at least two buses: one for I/O, and one connecting the CPU and DRAM. The latter is often split further, with a separate bus for the L2 cache. Review the example in Figure 6.7 in the book. (A signal where we don’t care if it’s a 1 or 0, but want to record a transition, or a bus, where each line could have different value, is represented as a pair of lines, crossing as a transition could occur.) make sure you understand the design issues (pp 509-512). Interfacing to the CPU is also an interesting issue: the general trend is towards off-loading control of I/O from the CPU. Note that memory-mapped I/O (also referred to as DMA: direct memory access) introduces some challenges for cache designers. Should the portion of the address space which belongs to the device be cached? If so, how should the cache ensure that the CPU will see a consistent version of the data, if the device writes to DRAM? This is an example of the cache coherency problem, which is addressed in more detail in Section 6.4 of these notes. 5.3 RAID Make sure you understand the definitions of reliability and availability, and how RAID (redundant array of inexpensive disks) addresses both. Although reliability, availability and dependability are important topics, we will leave them out in this course for lack of time, and only look at RAID as an example of bandwidth versus latency trade-offs. Of relevance to the focus of the course is the relationship between bandwidth and latency in RAID. It is of importance to understand that putting multiple disks in parallel

54

doesn’t shorten the basic access time, so latency is not improved by RAID (in fact the transaction time could be increased by the complexity of the controller) except as noted before (where performance measures are outlined in Section 5.4), faster throughput can improve latency by reducing time spent in queues. However, if the device is not highly loaded, this effect is not significant. Note various mixes of styles of parallelism permitted by variations in RAID. 5.4 I/O Performance Measures The section on I/O performance measures (6.5 of the book) is where we get to the heart of the matter. Make sure you understand throughput and response time issues (or think of them as bandwidth and latency). Note the knee in the curve in Figure 6.23 in the book, where there is a sharp change in slope. This is a typical latency vs. bandwidth curve, and results from loss of efficiency as requests increase. In a network like ethernet which has collisions, the effect is caused by an increase in collisions. In some I/O situations like an operating system sending requests to a disk, the losses occur from longer queueing delays (same effect, but the delay occurs in a different place). Note the discussion on scaling up bandwidth by adding more devices in parallel and how it does not improve latency (except to the extent that queueing delays may be reduced). You should know how to apply Little’s Law (queuing theory discussion) and understand the derived equations. Work through the given examples. Of particular interest by comparison with CPU performance measures is the fact that common I/O performance are designed to scale up: the dataset or other measure of the problem size has to be scaled up as the rated performance is increased. Look for example at how TPS benchmarks are scaled up in Figure 6.30 of the book. The idea of a self-scaling benchmark (pp 545-547) is also interesting. Note that this idea should not only apply to I/O. Make sure you understand the issues and how details of the system configuration (caches, file buffers—also called caches—etc.) can impact scalability of performance. 5.5 Operating System Issues for I/O and Networks The remainder of the Chapter 6 of the book contains some interesting material; make sure you understand the general issues. Work through the examples in Section 6.7, and under-

55

stand the issues raised in 6.8 and 6.9. There are also some operating system issues in the networking chapter, in Section 7.10. Find these issues, and relate them to more general I/O issues. 5.6 A Simple Network Let’s move on now to the networking chapter, and review Section 7.2 quickly. Make sure you understand the definitions and can work through the examples. Note particularly the impact of overhead on the achievable bandwidth, and relate the delivered bandwidth in Figure 7.7 to the data in Figure 7.8. 5.7 The Interconnect: Media and Switches Section 7.3 of the book is a useful overview of physical media, though not very important for the focus of the course. Of more interest is Section 7.4 which covers switched vs. shared media. Do not spend too much time with some of the more exotic topologies. For practical purposes, it’s useful to understand the differences between ATM and ethernet, and the potential for adding switches into ethernet. The more complex topologies were classically used to make exotic supercomputers which have since died. However, they may still have uses in specialized interconnects (e.g., a crossbar switch to implement a more scalable interconnect than a bus in a medium-scale shared-memory multiprocessor, or to implement the internals of an ATM switch). Switches have two major purposes: •

eliminating collisions—a loaded network with a single medium using collision detection, as does ethernet, loses an increasing fraction of its available bandwidth to collisions and retries as the workload scales up; switches by partitioning the medium and queueing traffic prevent collisions



partitioning the medium—if the traffic can be partitioned, the available bandwidth becomes higher (depending on how partitionable the traffic is): if n different routes can be simultaneously active, there is n times the bandwidth of a shared medium

Switches have two major drawbacks: •

they add cost—a single medium like coax cabling is cheap, whereas a switch, if it is not to introduce significant additional latency, is generally expensive

56



workloads don’t always partition—for example, if a computer lab boots off a single server and most network traffic is between the server and lab machines, the only useful way to partition the network is to put a much faster segment between the server and a switch, which splits the lab

Make sure you understand the definitions and issues raised in “Shared versus Switched Media” on pp 610-613. Be aware of the range of topologies available, but do not spend too much time on “Switch Topology”. Note that crossbar switches are starting to appear in commercial designs, as a multiprocessor interconnect (previously they were only used in exotic supercomputers, but SGI now uses them in relatively low-end systems). Note the discussion of routing and congestion control; these are more issues for a networks course. However, cut-through and wormhole routing (p 615) are used in some parallel systems: you should have some idea of how they work: work through the CM-5 example on p 616. 5.8 Wireless Networking and Practical Issues for Networks Read through the section on wireless networking (7.5). It is interesting but beyond the scope of this course. Also read the section on practical issues (7.6); there are some interesting points but they do not relate closely to the focus of the course. 5.9 Bandwidth vs. Latency It is useful at this stage to revisit all areas of I/O and networks where latency and bandwidth interact. Generally, bandwidth is easy to scale up (subject to issues like cost and physical limitations, like space in a single box): you just add more components in parallel. Latency however is hard to fix after the event if your basic design is wrong. For example, if you want a transaction-based system to support 10,000 transactions per second with a maximum response time of 1s, and you design your database so it can’t be partitioned, you run into a hard limit of the access time of a disk (typically 10ms; some are faster but 7ms is about the fastest in common use at time of writing). The average time per transaction to fit the required number into 1s is 0.1ms. Clearly, something different has to be done to achieve the desired latency: a solution like RAID, for example, will not help.

57

This is one of the reasons that IBM was able to maintain a market for their mainframes when microprocessors exceeded mainframes’ raw CPU speed in the early 1990s. While IBM made very large losses, they were able to recover, because no one had a disk system competitive with theirs. They lost the multiuser system and scientific computation markets, but were able to hold onto large-scale transaction processing. Here’s some historical information, summarized from the first edition of Hennessy and Patterson, to illustrate the principles. The development of a fast disk subsystem was not part of the original IBM 360 design; this followed as a result of IBM’s early recognition that disk performance was vital for large-scale commercial applications. The dominant philosophy was to choose latency over throughput wherever a trade-off had to be made. The reason for this is the view (often found to be valid in practice) that it’s much easier to improve throughput by adding hardware than to reduce latency (see the discussion of disk arrays and the effect of adding a disk for an example of adding throughput in this way). Put in another way: you can buy bandwidth, but you have to design for latency. The subsystem is divided into the following hierarchies: •

control—the hierarchy of controllers provides a range of alternative paths between memory and I/O devices, and controls the timing of the transfer



data—the hierarchy of connections is the path over which data flows between memory and the I/O device

The hierarchy is designed to be highly scalable; each section of the hierarchy can contain up to 64 drives, and the IBM 3090/600 CPU can have up to 6 such sections, for a total of 384 drives, or a total capacity of over 6 trillion bytes (terabyte: TB = 1024GB) using IBM 3390 disks. Compare this with a SCSI controller that can support up to 7 devices (with current technology, maybe 63GB). The channel is the best-known part of the hierarchy. It consists of 50 wires connecting 2 levels of the hierarchy. 18 are used for data (8 + 1 for parity in each direction), and the rest are for control information. Current machines’ channels transfer at 4.5MB/s. (Contrast this with the maximum rate of SCSI disks: about 4MB/s, though this is in only one direction at a time). Each channel can connect to more than one disk, and there are redundant paths

58

to increase availability (if a channel goes down, performance drops, but the machine is still usable). Goals of I/O systems include supporting the following: •

low cost



a variety of I/O devices



a large number of I/O devices at a time



low latency

High expandability and low latency are hard to achieve simultaneously; IBM achieve this by use of hierarchical paths to connect many devices. This plus all the parallelism inherent in the hierarchy allows simultaneous transfers, high bandwidth is supported. At the same time, using many paths instead of large buffers to accommodate a high load minimizes latency. Channels and rotational position sensing help to ensure low latency as well. The key to good performance in the IBM system is low rotational positional misses and low congestion on channel paths. This architecture was very successful, judging both from the performance it delivered and the duration of its dominance of the industry. If we cross now to networks, ATM is having a hard time making headway against established technologies, like ethernet, because the major gain is from switching, something that can be added on top of a classically-shared-medium standard. ATM’s fixed-size packet is meant to make routing easier, but the potential wins have to be measured against the fact that ATM cells are a poor fit to existing designs, which all assume ethernet-style variablelength packets. Possible a system designed around ATM from scratch would perform better, but ATM starts from a poor position of having extra latency added up front: splitting larger logical packets into ATM cells, and recombining them at the other end. 5.10 Trends: Learning Curves and Paradigm Shifts The internet provides an interesting exercise in mismatches in learning curves. Bandwidth is expanding at a rapid rate globally, at a rate of 1.78 times per year, a doubling every 4 years. At the same time, no one is paying particular attention to latency. A fast

59

personal computer today can draw a moderately complex web page too fast to see the redraw, but the end-to-end latency to the server may be in minutes. As a result, quality of service is a major inhibiting factor to serious growth of internet commerce. Using the internet to shop for air tickets, for example, is viable, given that the latency is competing with a trip to a travel agent (or being put on hold because the agent is busy). However, using the internet to deliver highly interactive services is problematic. Another area which is problematic is video on demand (VoD). The general VoD problem is to deliver any of hundreds of movies with minimal latency (at least as good as a VCR), with the same facilities as one would expect on a VCR: fast forward, rewind, pause, switch to another movie. Currently shipping designs for VoD are very complex, and require supercomputer-like sophistication to scale up. The key bottleneck in VoD designs unfortunately is latency, rather than bandwidth. Although faster computers can reduce overheads, the biggest latency bottlenecks are in the network and disk subsystems, the very areas where learning curves are not driving significant improvements. Disks for example, have barely improved in access time in 5 years, while network latency as noted before is not seriously addressed in new designs. The next section illustrates some alternative approaches that can be used to hide the effects of high latency. 5.11 Alternative Schemes 5.11.1 Introduction This section introduces a novel idea, Information Mass Transit (IMT), and applies it to two examples: VoD and transaction-based systems. The basic idea is that in many high-volume situations, it is relatively easy to not only provide the required bandwidth, but potentially several times the required bandwidth, but latency goals are intractable, if each transaction is handled separately. Handling each transaction separately is analogous to every commuter driving their own car. While this seems to be the most efficient method in terms of reducing delays waiting for fixed-schedule transport or fitting in with non-scheduled shared options like lift clubs, the overall effect when the medium becomes saturated is slower than a mass transit-based approach. To take the analogy further, the IMT idea proposes attempting to

60

exploit the relative ease of scaling up bandwidth to fake latency requirements, by grouping requests together along with more data than is actually requested. An individual request has to wait longer than the theoretical minimum latency of the architecture, but the aggregated effect is shorter delays, i.e., lower latency. To illustrate the principles, an approach to video on demand is presented, followed by an alternative model for disks in a transaction-based system. 5.11.2 Scalable Architecture for Video on Demand A proposed scalable architecture for video on demand (SAVoD) is based on the IMT principle. Conventional VoD requires that every request be handled separately, which makes it hard to scale up. A full-scale VoD system has the complexity (and expense) of a high-end supercomputer. Various compromises have been proposed, like pre-booking a movie to reduce latency requirements, or clustering requests so that actions like fast forward and rewind have a latency based on skipping forward or backward to another cluster of requests. All of these approaches still require significant complexity in the server (and as the number of users scales up, the network). The SAVoD approach is to stream multiple copies of each available movie a fixed time interval apart. For example, if the desired latency for VCR operations is at worst 1 minute, and the typical movie is 120 minutes long, 120 copies of the movie are simultaneously active, spaced 1 minute apart. VCR-like operations consist of skipping forward or backward between copies of a stream, or to the start of a new stream. There are a few complications in implementation, but the essential idea is very simple. A server consists of little more than an array of disks streaming into a network multiplexor, which has to time the alternative streams to their required separation, before encoding them on the network. The main interconnect would be a high-capacity fibre-optic cable. At the user end, a local station would connect multiple users to the high-speed interconnect, with a low-speed link to the individual subscriber. This low-speed link would only have to carry the bandwidth required for one channel, as well as spare capacity for control signals.

61

A SAVoD implementation with 1,000 120-minute movies at HDTV broadcast standard and 1-minute separation would require a 2Tbit/s bandwidth, but smaller configurations with fewer movies and lower-standard video could be implemented to get the standard started. A particular win for the SAVoD approach is that the bandwidth required depends on the video standard and number of movies, not the number of users. In fact as the number of users scales up, an increasingly improved service can be offered, as the service is spread over a larger client base. Latency is not an issue at the server and network levels; if a higher-quality VCR service is required, the current movie can be buffered in local disk either at the local access point, or at the subscriber. 5.11.3 Disk Delay Lines for Scalable Transaction-Based Systems Transaction-based systems are another example of an application area where scaling latency up the traditional way is hard. Examples include airline reservation systems, large web sites (which increasingly overlap other kinds of transaction-based systems) and banking systems. Although there are many other factors impacting on latency, it’s easy to see that disk performance is a big factor. If the basic access time is 10ms, the fastest transaction rate a disk can handle with one access per transaction, for a worst-case response time of 1s is only 100 transactions per second (TPS). If disk time (probably more realistically) is taken as at worst half the transaction time, only 50 TPS can be supported. This is why systems with a very high TPS rating typically have 50-100 disks. How can the IMT idea help here? Let’s consider where a disk’s strength lies: streaming. Assume we have a high-end disk capable of streaming at 40Mbyte/s and a relatively small database of 1Gbyte. Can we persuade such a device to allow us to reach 100,000 TPS, with 0.5s only for disk access? Assume each transaction needs to access 128 bytes. Then at 40Mbyte/s, one transaction takes 3µs. This looks promising. But if we simply sweep all the data out at 40Mbyte/s, the worst-case time before any one part of the database is seen is 25.6s, about 50 times too long. Simple solution: use 50 disks, synchronized to stream the database out at equally-spaced

62

time intervals. Now the worst-case delay is approximately 0.5s, where we want it to be. To be able to handle 100,000 TPS, we need to be able to buffer requests for the worst-case scenario where every request waits the maximum delay. Since the maximum delay is approximately 0.5s, we need to buffer 50,000 requests. Writing presents a few extra problems, but should not be too hard to add to the model. What this idea suggests is that disk manufacturers should give up the futile attempt at improving disk latency and concentrate instead on the relatively easy problem of improving bandwidth, for example, by increasing the number of heads and the speed of the interconnect to the drive. The general idea of a storage medium which streams continuously is not new. Some of the oldest computer systems used an ultrasonic mercury delay line, which was a tube filled with mercury. An ultrasonic sound, encoding the data, was inserted in one end of the tube. Since the speed of sound is relatively low by even the primitive standards if the computer world in the 1950s, several hundred bits could be stored in a relatively small tube, utilizing the delay between putting the data in one end and retrieving it as it came out the other end. The idea proposed here is a disk delay line. The only real innovation relative to the original idea is having multiple copies of the delay line in parallel to reduce latency (i.e., the delay before the required data is seen). The TPS in the calculation is not necessarily achievable, as the rest of the system would have to keep up: a high-end TPS rating in 1998 is only in the order of a few thousand, on a large-scale multiprocessor system with up to 100 disks. How much of the overall system is simply dedicated to the problem of hiding the high latency of disks in the conventional model is an important factor in assessing further whether the disk delay line model would work. 5.12 Further Reading The internet learning curve is described by Lewis [1996b]. More can be found on the Information Mass-Transit ideas mentioned here on the Information Mass-Transit web pages [IMT 1997], including papers which have been submitted for publication. Consult the usual journals and conferences for issues like advances in RAID and network technology.

63

5.13 Exercises 5.1. Do questions 6.1, 6.2, 6.4, 6.5, 6.10, 7.1, 7.2, 7.3, 7.10. 5.2. Derive a general formula for the achievable TPS in the disk delay line approach, taking into account the following (assume one access per transaction): • size of database •

transfer rate of the disk



number of disks in parallel



size of individual transaction



size of buffer



maximum disk time allowed per transaction

5.3. Discuss the use of network switches in the network and server configuration we have in our School (including equipment shared with other departments). What is the role of switches and routers here, in terms of bandwidth issues? 5.4. Note the impact of overhead on the achievable bandwidth, and relate the delivered bandwidth in Figure 7.7 to the data in Figure 7.8. Also consider the following figure (derived from on in an earlier edition of the book), which compares ethernet and ATM throughput (think about your answer to question 7.10): 80 70 155Mbit/s ATM 60 50 40 30 20 10Mbit/s ethernet 10 0

a. What does this tell you about network design? b. If a new network design has a very high bandwidth but packets are complex to create, 64

what kind of performance would you expect with a typical NFS workload? c. Given the data in the figure above, if you had the choice between deploying ATM and partitioning traffic on a university network using switches in ethernet, which would you chose? Why? 5.5. Another possible application of Information Mass Transit is a variation on multicast, in which several alternative multicast streams are maintained simultaneously. If a receiver drops a packet from one stream, they drop back to a later stream. Discuss this alternative versus: a. individual connections for every receiver b. conventional multicast

65

66

Chapter 6 Interconnects and Multiprocessor Systems Scaling up a single-processor system is a challenge: does adding more processors make things easier of harder? Certainly, it has long been predicted that the CPU improvement learning curve will hit a limit and that the only way forward will be multiprocessor systems. Each time, however, before the predicted limit has been reached, a paradigm shift has saved the day (for example, the shift to microprocessors in the 1980s). Still, multiprocessor systems fill an important niche. For some applications which can be parallelized, they provide the means to execute faster—before a new generation of uniprocessor arrives, which is just as fast. In other cases, a multiprogramming job mix can be executed faster. The traditional multiuser system—historically a mainframe or minicomputer—is a classic example of a multiprogramming requirement, but a modern window-based personal computer also has many processes running at once, and even multithreaded applications, which can potentially benefit from multiple processes. One factor though is pushing multiprocessor systems into the mainstream: the fact that Intel is running out of ideas to speed up the IA-32 microarchitecture. For this reason, multiprocessor systems are becoming increasingly common. On the other hand, slow progress with IA-64 and competition from AMD is forcing Intel to squeeze more speed out of IA32—so the long-predicted death of the uniprocessor remains exaggerated. Be that as it may, there is still growing interest in affordable multiprocessor systems. The interconnect is an important part of scalability of multiprocessor systems. Scalability has historically meant very big designs are supported. I argue that scalability should include very small designs; a system that only makes economic sense as a very large-scale design does not have a mass-market “tail” to drive issues like development of programming tools. 67

6.1 Introduction This chapter contains a brief overview of Chapters 8 of the book, with some backward references to Chapter 7, since networks provide some background for multiprocessors. The authors of the book clearly agree, as they have added a section (7.9) on clusters. The major focus is on shared-memory multiprocessors, since this is the common model in relatively wide use. A variant on shared memory, distributed shared memory (DSM) is also covered, along with issues in implementation and efficiency. The order of topics in the remainders of the chapter is as follows. Section 6.2 covers classification of multiprocessor systems, while Section 6.3 classifies workloads. Section 6.4 introduces the shared-memory model, with the DSM model in Section 6.5. Memory consistency and synchronization, important issues in the efficiency of sharedmemory and DSM computers, are handled together in Section 6.6. In Section 6.7, issues relating to a range of architectures including uniprocessor systems are covered. To conclude the chapter, there are sections on trends (6.8) and alternative models (6.9). 6.2 Types of Multiprocessor Read through Section 8.1 of the book. Note the arguments for considering multiprocessor systems, and the taxonomy1 of parallel architectures, into SISD, SIMD, MISD and MIMD. There is another argument for MIMD shared-memory machines not given with the taxonomy, but which appears later (p 679). Such a machine can be programmed in the style of a distributed-memory machine by implementing message-passing in shared memory. This means that whichever model of programming is more natural can be adopted on a sharedmemory machine. On a message-passing machine on the other hand, implementing shared memory is complex and likely to result in poor performance. The DSM model covered later is a hybrid, with operating system support for faking a logical (programmer-level) shared memory on a distributed system. Another common term which is introduced is symmetric multiprocessor (SMP). An SMP is a multiprocessor system on which all processors are conceptually equal, as opposed

1. A taxonomy is a classification based on objective (as opposed to arbitrary) criteria.

68

to some models where specific functionality (e.g. the OS, or work distribution) may be handled on specialized (or specific) processors. Most shared-memory systems are SMP. Make sure you understand the differences between the models, the communication strategies as relate to memory architecture, performance metrics, advantages of the various communication models and the challenges imposed by Amdahl’s Law. In particular, in terms of the focus of the course, think about the effect of buying a machine with a fixed bus bandwidth and memory design, with the intention of buying faster processors over the next 3 to 5 years. Work through the example on pp 680-681, and on pp 681-683. 6.3 Workload Types Section 8.2 of the book lists several parallel applications. Make sure you understand how the different applications are characterized, and how they differ with respect to communication and computation behaviour. Note also the multiprogramming workload, and how it differs. Note that a key difference is that the parallelism here is between separate programs, with only limited communication in the form of UNIX pipes. In this scenario, the operating system is likely to generate most synchronization activity, as it has to maintain internal data structures like the scheduler’s queues across processors. 6.4 Shared-Memory Multiprocessors (Usually SMP) Understand the basic definition of a shared-memory system at the start of Section 8.3 of the book. Note that data in a shared-memory system is usually maintained on the granularity of cache blocks, and a block is typically tagged as being in one of the following states: •

shared—in more than one processor’s cache and can be read; as soon as it is written, it must be invalidated in other caches



modified—in only one cache; if another cache reads or writes it, it is written back; if the other cache writes it, it becomes invalid in the original cache, otherwise shared



exclusive—only in this cache: can become modified, shared or invalid depending on whether its processor writes it, another reads it, or another modifies it (after a write miss)



invalid—the block contains no valid memory contents (either as a result of invalidation, or because nothing has yet been placed in that block) 69

There are variations on this model, as we will see in Section 6.6. Also, the DSM model has some variations on management of sharing. The key issue to understand is how communication happens, and what some of the pitfalls are. If one process of a multiprocessor application communicates with another, it writes to a shared variable. There are several issues in ensuring such programs are correct, involving synchronization, which we won’t go into in much detail (see Section 6.6 again). Here, we will concern ourselves more with performance. Here are a few issues before we get to performance. Understand what cache coherence is, and the difference between private and shared data. Go through the descriptions of approaches to coherence, including the performance issues. Note the differences in issues raised for performance of multiprocessor applications and multiprocessor workloads. A few issues are important for achieving good performance in shared-memory multiprocessing applications: •

minimal sharing—although the shared-memory model makes it look cheap to share data, it is not. Invalidations are a substantial performance bottleneck in shared-memory applications, so writing shared variables extensively is not scalable; for example, if a shared counter logically has to be updated for every operation performed but is only tested infrequently, keep a local count in each process, and update the global count as infrequently as possible from the local count



avoid false sharing—because memory consistency (or coherence) is managed on a cache-block basis, false sharing can occur: read-only data coincidentally in the same cache block as data which is modified can cause unnecessary invalidations and misses, since accessing non-modified data is not distinguished from accessing modified data in the same cache block; the solution is to waste memory by aligning data to cache block boundaries and padding it if necessary so only truly shared data is in a given block



minimal synchronization—related to minimal sharing in that synchronization is a form of communication, but also, the less often a process has to wait for others, the less likely you are to run into a load imbalance (where one or more processors has to wait an extended time for others) as well

All these factors tend to combine to produce a less-than-linear speedup for most practical applications. See for example Figure 8.24, in which illustrates how a program where data is not optimally allocated to fit cache blocks without false sharing has increased coherence

70

misses as the block size increases. Note also how high a fraction of data traffic can be due to coherence (i.e. communication). However, on the positive side, a multiprocessor application can in some circumstances achieve a better speedup than one might expect. If the dataset is much larger than the cache of any available uniprocessor system, partitioning the workload across multiple processors whose total cache is bigger than the dataset can give better than linear speedup—provided communication is low. 6.5 Distributed Shared-Memory (DSM) A DSM architecture attempts to get the best of both worlds. A distributed-memory architecture generally has a more scalable interconnect than a shared-memory architecture, because the interconnect is often partitioned (see the arguments for switches in networks in Section 5.7 of these notes). However, the basic interconnect speed is slower, as it is usually a network like ethernet. The essential model looks like a network of uniprocessor (sometimes shared-memory multiprocessor) machines, in which a global shared memory is faked by the operating system—often using the paging mechanism. Since a page fault is a relatively expensive operation (network latency is typically of the same order as disk latency, and, except for higher-end networks, transfer rate tends to be slower: 100baseT for example can at peak achieve about 11 Mbyte/s), communication is an even more difficult issue in DSM than ordinary shared memory. Typically, DSM models attempt to work around the problem of issues like false sharing in the large transfer units (page-sized, 4 Kbytes or more) by doing consistency on smaller units. Some caches also do this (for example, a 128-byte block may have consistency maintained at the 32-byte level). Read through the description of DSM, and directory-based protocols. Directory-based protocols have also been used for large-scale shared-memory systems (snooping doesn’t scale well). Note also the performance issues, and make sure you can do the examples. The key issues though are handled in the next section, where memory consistency is handled in more detail.

71

6.6 Synchronization and Memory Consistency There is some link between synchronization and consistency. Often, synchronization is needed to ensure communication is done correctly, which implies writes and hence that memory consistency should be maintained. To address performance issues, it is useful to consider the two issues together. First, let’s consider issues in synchronization. The primitives depend on the programming style of the parallel application. The simplest primitive is a lock, which is generally needed even if higher-level constructs are built out of locks. A lock controls access to a critical region, a section of code where sequentialization is important to ensure that a write operation is atomic (a condition where the result of a write depends on the order of processing is called a race condition). Locks can be implemented in various ways. The simplest is a spinlock: the code attempts to test a variable and set it in an atomic operation (it is usual to have an instruction designed to do this atomically); if the test fails, the spinlock loops. A spinlock is very expensive in memory operations, as every time it is set, it generates an invalidation to every process that is trying to read it, followed by a flurry of attempts at writing it: the one process that wins the race to write it succeeds in the test-and-set operation, while the others fail and loop again, and have to reload the value in their caches (at the same time converting it from exclusive to shared in the other cache, usually also resulting in a writeback to DRAM). Since it’s important for the cache coherence mechanism to handle locks correctly, a natural next step is to incorporate the locking mechanism in the coherence mechanism. While various approaches are covered, none go as far as some implemented by researchers. Work through the various variations on locks, and make sure you can compute how much bus traffic each generates. Also note the alternatives of a queueing lock, and exponential backoff. Two other primitives are semaphores and barriers. A semaphore is in essence a counter, which is decremented when gaining access to a resource, and incremented when giving it up. A semaphore can be used to control the number of simultaneous users of the resource (e.g. by setting the initial value to some n > 1, n processes can access the resource before accesses are blocked). Semaphores usually have a queue associated with them to keep sleeping process in order.

72

A barrier is used to stop 1 or more processes waiting on completion of an event by another process. For example, in a time-stepped simulation, it is common to force all processes to synchronize at the end of a timestep to communicate. A barrier in its simplest form can be implemented using a semaphore or equivalently a lock and a counter, but does not scale up well. Work through the rest of the section on synchronization, and make sure you can do the examples. Section 8.6 of the book goes on to memory consistency, and note how the issues interact with synchronization. The most important principle exploited by designers of relaxed models of consistency is that programmers don’t actually want race conditions, so any writes to shared variables are likely to be protected by synchronization primitives. Understand the general ideas, but spend more time on the performance issues than on specifics of each model. 6.7 Crosscutting Issues It is interesting to note that the increased level of ILP and the growing CPU-DRAM speed gap is pushing more and more multiprocessor issues into the uniprocessor world. Some of these ideas include nonblocking caches and latency hiding. Other issues common to the two areas are inclusion and coherence, both useful in uniprocessor systems to deal with non-CPU memory accesses (typically DMA). A nonblocking cache allows the CPU to continue across a miss until the data or instruction reference is actually needed. In its original form, it was used with nonblocking prefetch instructions, instructions that did a reference purely to force a cache miss (if needed) ahead of time. In uniprocessor designs, the same idea has surfaced in the Intel IA-64 (though of course the processor could be used in multiprocessor systems if it ships), but it is also becoming increasingly common for ordinary misses to be nonblocking, especially in L1. If the miss penalty to L2 is not too high, there is some chance that a pipeline stall can be avoided by the time the processor is forced to execute the instruction that caused the miss (or complete the fetch, if it was an I-miss). Misses to DRAM are becoming too slow to make nonblocking L2 access much of a win. The rest of the chapter contains interesting material but we will not cover it in detail.

73

6.8 Trends: Learning Curves and Paradigm Shifts The biggest paradigm shift in recent years has been the recognition that the biggest market is for relatively low-cost high-performance systems. The rate of advance of CPU speed makes a highly upgradable system largely a waste, as the cost of allowing for many processors, a large amount of memory, etc., is high. If a mostly-empty box is bought, a year or two later the expensive scalable infrastructure is too slow for a new CPU to make sense, as opposed to replacing the whole thing by a lower-end new system that’s faster than the old components. Instead, the trend is towards smaller-scale parallel systems, with options of clustering multiple small units to get the effect of a larger one. SGI has been effective at this strategy, though they have lost ground in recent years through poor strategy in other areas. The tailing off in microarchitecture improvement in the IA-32 line is creating new impetus in the small-scale SMP market at the lower end, and it seems likely that the clustered hybrid shared-memory-DSM model will gain ground as a result. The clear loser in recent years has been a range of alternative models: MPP (massively parallel processor) systems, distributed-memory systems, and SIMD systems. All of these models have the problem that the focus of the designers was scaling up, not down. If a massmarket version of a system cannot be made, it has a high risk of failure for the following reasons: •

poor programming tools—there’s nothing like a mass-market application base to stimulate development tool innovation (compare a Mac or PC-based environment with most UNIX-based tools: the UNIX-based tools only really score in being on a more robust platform, an advantage slowly eroding over time as mass-market operating systems discover the benefits of industrial-strength protection and multitasking)



high marginal profit or loss—if a PC manufacturer loses a sale, it’s one sale in thousands to millions; if a specialized supercomputer manufacturer loses a sale, it could be a large fraction of their year’s revenues (Cray Research, when SGI bought the company, had annual sales of around $2-billion; at the typical cost of a supercomputer, that’s about 200 machines per year)



limited market—the more specialized designs not only were unsuited to wider markets like multiprogramming workloads, but also to a wide range of parallel applications: they were based on specialized programming models that did not suit all large-scale computation applications

74



high cost of scalability—if conventional multiprocessor systems suffer from the problem that large-scale configurations are expensive, this is even more true of specialized designs, which often employed exotic interconnects; to make matters worse, some designs (like the Hypercube) could only be upgraded in increasingly large steps, to maintain their topology

6.9 Alternative Schemes There are various alternatives implementations of barriers, which reduce the amount of global communication. Another alternative is to have parts of the computation only do local communication, which adds up to global communication over the totality of the parts. For example, in a space-based simulation such as a wind tunnel simulation, local regions of space could communicate with nearest neighbours by spinning on the nearest neighbour’s copy of the global clock. When the neighbour’s clock matches the current unit of space’s clock, it can go on to the next timestep. While distributed shared memory has problems with communication speed, there is increasing interest in hybrid schemes, which allow larger systems to built up out of clusters of smaller multiprocessor systems. The basic building blocks are similar to a conventional shared-memory system, with a fast interconnect which may nonetheless be slower than a traditional bus. SGI, for example, builds such systems, based on the Stanford DASH project. Look for more examples in the book. 6.10 Further Reading There has been some work on program structuring approaches which improve cache behaviour of multiprocessor systems, and which attempt to avoid the need for communication primitives like barriers which scale poorly [Cheriton et al. 1993]. The ParaDiGM architecture [Cheriton et al. 1991] contains some interesting ideas about coherency-based locks as well as the notion of scaling up shared memory with a hierarchy of buses and caches, while the DASH project [Lenoski et al. 1990] was one of the earliest to introduce latency-hiding strategies (an issue now with uniprocessor systems). Tree-based barriers attempt to distribute the synchronization overhead, so a barrier does not become a hot-spot for global contention for locks [Mellor-Crumney and Scott 1991].

75

6.11 Exercises 6.1. You have two alternatives of similar price for buying a computer: • a 4-processor system with 1 Gbyte of RAM and 20Gbytes of disk, but which cannot be upgraded further •

a 2-processor system with 256 Kbytes of RAM and 10 Gbytes of disk, which can be expanded to 20 processors, 100 Gbytes of disk and 16Gbytes of RAM



adding one extra processor to the 2-processor system costs approximately the same as the 4-processor system, if you figure in a trade-in of last year’s model

a. Discuss the economics of upgrading the 2-processor system over the next 3 years, versus replacing the 4-processor system every year by a faster one. b. Discuss the impact of CPU learning curves on the usefulness of the remainder of the hardware on the 2-processor system, as new upgrades are bought in the future 6.2. Do questions 7.6, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.10, 8.11, 8.13, 8.14, 8.16, 8.17, 8.20.

76

References [August et al. 1998] D August, D Connors, S Mahlke, J Sias, K Crozier, B Cheng, P Eaton, Q Olaniran and W Hwu. Integrated Predication and Speculative Execution in the IMPACT EPIC Architecture, Proc. ISCA ’25: 25th Int. Symp. on Computer Architecture, Barcelona, June-July 1998, pp 227-237. [Cheriton et al. 1991] DR Cheriton, HA Goosen and PD Boyle. ParaDiGM: A Highly Scalable SharedMemory Architecture, Computer, vol. 24 no. 2 February 1991, pp 33–46. [Cheriton et al. 1993] DR Cheriton, HA Goosen, H Holbrook and P Machanick. Restructuring a Parallel Simulation to Improve Cache Behavior in a Shared-Memory Multiprocessor: The Value of Distributed Synchronization, Proc. 7th Workshop on Parallel and Distributed Simulation, San Diego, May 1993, pp 159– 162. [Crisp 1997] R Crisp. Direct Rambus Technology: The New Main Memory Standard, IEEE Micro, vol. 17 no. 6, November/December 1997, pp 18-28. [Dulong 1998] Carole Dulong. The IA-64 Architecture at Work, Computer vol. 31, no. 7, July 1998, pp 24-32. [IMT 1997] Information Mass-Transit web site, continuously updated. . [Jacob and Mudge 1998] B Jacob and T Mudge. Virtual Memory: Issues of Implementation, Computer, vol. 31 no. 6 June 1998, pp 33-43. [Johnson 1995] EE Johnson. Graffiti on the Memory Wall, Computer Architecture News, vol. 23, no. 4, September 1995, pp 7-8. [Lenoski et al. 1990] D Lenoski, J Laudon, K Gharachorloo, A Gupta and J Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor, Proc. 17th Int. Symp. on Computer Architecture, Seattle, WA, May 1990, pp 148–159. [Lewis 1996a] T Lewis. The Next 10,0002 Years: Part I, Computer vol. 29 no. 4 April 1996 pp 64-70. [Lewis 1996b] T Lewis. The Next 10,0002 Years: Part II, Computer vol. 29 no. 5 May 1996 pp 78-86.

77

[Machanick 1996] P Machanick. The Case for SRAM Main Memory, Computer ArchitectureNews, vol. 24, no. 5, December 1996, pp 23-30. . [Machanick 2000] Scalability of the RAMpage Memory Hierarchy, South African Computer Journal, no. 25 August 2000, pp 68-73 (longer version: Technical Report TR-Wits-CS-1999-3 May 1999). [Machanick and Salverda 1998a] P Machanick and P Salverda. Preliminary Investigation of the RAMpage Memory Hierarchy, South African Computer Journal, no. 21 August 1998, pp 16–25. . [Machanick and Salverda 1998b] P Machanick and P Salverda. Implications of Emerging DRAM Technologies for the RAMpage Memory Hierarchy, Proc. SAICSIT ’98, Gordon’s Bay, November 1998, pp 27–40. [Machanick et al.1998] P Machanick, P Salverda and L Pompe. Hardware-Software Trade-Offs in a Direct Rambus Implementation of the RAMpage Memory Hierarchy, Proc. ASPLOS-VIII Eighth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, October 1998, pp 105–114. . [Mellor-Crumney and Scott 1991] JM Mellor-Crumney and ML Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM Trans. on Computer Systems, vol. 9 no. 1 February 1991, pp 21–65. [Nagle et al. 1993] D Nagle, R Uhlig, T Stanley, S Sechrest, T Mudge and R Brown. Design Trade-Offs for Software Managed TLBs, Proc. Int. Symp. on Computer Architecture, May 1993, pp 27–38. [Patterson and Ditzel 1980] DA Patterson and DR Ditzel. The case for the reduced instruction set computer, Computer ArchitectureNews, vol. 8, no. 6, October 1980, pp 25-33. [RAMpage 1997] RAMpage web site, continuously updated. . [Rivers et al. 1997] JA Rivers, GS Tyson, TM Austin and ES Davidson. On High-Bandwidth Data Cache Design for Multi-Issue Processors. Proc. of 30th IEEE/ACM Int. Symp. on Microarchitecture, December 1997. . [Wall 1991] David W. Wall. Limits of Instruction-Level Parallelism, Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, April 1991, pp 176–188.

78

[Wulf and McKee 1995] WA Wulf and SA McKee. Hitting the Memory Wall: Implications of the Obvious, Computer Architecture News, vol. 23 no. 1, March 1995, pp 20-24.

79

80