PRISM: An Integrated Architecture for Scalable Shared ... - CiteSeerX

34 downloads 2421 Views 110KB Size Report
cache coherence protocol, and memory management algo- rithms. Results from .... Cache. Directory. Protocol. FSM. Protocol. Dispatcher. Memory Bus. Physical. Address ..... node field of the PIT entry is extended to include a dynamic home, as well ..... policies initially service page faults by allocating S-COMA page frames ...
PRISM: An Integrated Architecture for Scalable Shared Memory Kattamuri Ekanadham, Beng-Hong Lim, Pratap Pattnaik and Marc Snir IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598

feknath,bhlim,pratap,[email protected]

Abstract This paper describes PRISM, a distributed sharedmemory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance. PRISM’s hardware provides mechanisms for flexible management and dynamic configuration of shared-memory pages with different behaviors. As an example, PRISM can provide configure individual sharedmemory pages in both CC-NUMA and Simple-COMA styles, maintaining the advantages of both without incorporating any of their disadvantages. PRISM’s operating system is structured as multiple independent kernels, where each kernel manages the resources on its local node. PRISM’s system structure minimizes the amount of global coordination when managing shared-memory: page faults do not involve global TLB invalidates, and pages can be replicated and migrated without requiring global coordination. The structure also provides natural fault containment boundaries around each node because physical addresses do not address remote memory directly. We simulate PRISM’s hardware, cache coherence protocol, and memory management algorithms. Results from SPLASH applications on the simulated machine demonstrate a tradeoff between CC-NUMA and Simple-COMA styles of memory management. Adaptive, run-time policies that take advantage of PRISM’s ability to dynamically configure shared-memory pages with different behaviors significantly outperform pure CC-NUMA or Simple-COMA configurations and are usually within 10% of optimal performance.

1

Introduction

The quest for scalability has led shared-memory multiprocessor architectures to evolve from centralized, bus-based systems to distributed, network-based systems [15, 1, 11, 17, 14]. These distributed shared-memory (DSM) systems are typically composed of workstation-class compute nodes connected via a high-speed, low-latency network and a coherence controller. Together, the coherence controller and operating system manage the distributed memory as a single, globally-addressable shared memory. The operating system manages shared-memory allocation and mapping and the coherence controller ensures that shared memory cache lines remain globally coherent.

The memory management structure has a strong influence on the performance, scalability, and reliability characteristics of DSM systems. As an example, compare CCNUMA [15] and Simple-COMA [22] (S-COMA) systems. In CC-NUMA systems, a physical address serves as global address for naming shared data: it directly addresses a unique memory location in one of the nodes. In S-COMA systems, a physical address refers exclusively to its local node’s memory, and a separate global address space exists for naming shared data. Space for shared data must be allocated in local memory before the data can be accessed. Neither system structure satisfies everyone’s needs. From a performance standpoint, CC-NUMA systems perform poorly when the working set exceeds processor cache sizes and data has to be repeatedly refetched from remote memory. S-COMA systems suffer from poor page utilization under sparse access patterns and from excessive paging when the working set exceeds the size of local memory. From a reliability standpoint, the use of physical addresses as global names makes CC-NUMA systems susceptible to wild writes from any faulty node in the system. SCOMA systems manage each node’s physical memory independently from the other nodes and provide natural fault containment boundaries around each node. This paper introduces PRISM, a scalable shared-memory architecture that relies on an integrated hardware and operating system structure for both scalable and reliable performance. PRISM uses off-the-shelf hardware and OS components as much as possible, and provides only features that are necessary for scalability and reliability. The hardware provides flexible mechanisms for page replication, page migration, global naming, and cache coherence. The design philosophy is to provide hardware mechanisms that are managed in software with minimal global coordination. PRISM’s coherence controller can be implemented as a minor addition to conventional DSM hardware. The controller provides a facility to take different coherence actions based on a per-page basis. PRISM’s operating system is structured as multiple independent kernels on each node, and each kernel can dynamically and independently determine the behavior of each local page. Each kernel may be an instance of a full operating system like Unix or Windows-NT, with minor modifications to the virtual memory manager. As a demonstration of its flexibility, PRISM allows each kernel to manage a shared page independently in either CC-NUMA

Physical Address

Physical Address

Remote Access Cache Directory

Coherence Controller

Protocol Dispatcher

 Localized memory allocation and translation. Each node’s kernel manages a completely node-private translation between virtual and physical addresses. This allows a standard virtual memory manager to run on each node and reduces the need for global TLB invalidations.

Memory Bus

Protocol FSM

or S-COMA styles, providing the advantages of both styles and avoiding their disadvantages. PRISM provides the following novel features that enhance scalable performance and fault containment, in addition to flexible management of globally shared memory:

Physical Address

Memory

Network Interface Physical Address

 Lazy page migration. The home node of a page contains the directory information for the cache lines within that page. Non-home nodes are called client nodes. Lazy page migration allows home nodes to be migrated without invalidating address translations. Client nodes update their knowledge of the home node’s location only when they access the page. This provides efficient page migration.  Optional mapping of shared-memory pages at client nodes into local or remote memory. A page frame1 that maps to local memory provides S-COMA behavior. A page frame that maps (indirectly via global addresses) to memory at remote node provides CC-NUMA behavior.  User-controlled granularity of virtual to global address binding. This allows the cost of global coordination due to global address binding to be amortized over large user-defined regions instead of small, fixed-size pages. Once bound, address translations are maintained locally. Results from simulating PRISM on a set of eight SPLASH applications show a significant performance difference with pages mapped in CC-NUMA or S-COMA styles. Most of the difference can be attributed to capacity effects (due to finite cache sizes or limited associativity) in both the processor cache and the S-COMA memory-based page cache. The results also show that simple adaptive run-time algorithms for choosing page types allows PRISM to achieve the best possible performance. The rest of this paper is organized as follows. Section 2 compares memory management structures in existing DSM systems. Section 3 describes PRISM’s system architecture and how it provides the features listed above. Section 4 presents experimental data that demonstrate the advantages of PRISM’s flexible hardware and OS structure. Section 5 describes related work and Section 6 summarizes and concludes the paper.

2

DSM memory management

This section contrasts the different memory management structures in existing DSM systems and their tradeoffs. The comparison motivates the PRISM architecture and allows us 1 We use the term page to refer to a virtual or global page, and the term page frame to refer to a physical page.

Virtual Address Spaces Physical Address Space Memory Client Node

Home Node

Figure 1: Coherence controller architecture and memory mappings in CC-NUMA multiprocessors.

to highlight the salient differences of PRISM from existing DSM architectures. We assume a generic DSM architecture composed of a network of compute nodes with a coherence controller attached between the memory bus and network interface at each node. The primary differences between the architectures lie in the coherence controller and the operating system’s virtual memory management subsystem.

2.1 Single, global physical address space Both CC-NUMA [15] and COMA [11, 12] architectures provide the operating system with the abstraction of a single, global physical address space for shared memory. Consider a typical CC-NUMA system. Figure 1 illustrates a CCNUMA coherence controller and the address translations. The controller’s protocol dispatcher receives coherence traffic on the memory bus and the network, and dispatches protocol handlers to the finite state machine (FSM). The directory maintains the state and a list of nodes sharing each line of shared memory. An optional remote access cache provides an extra level of caching for remote data. The coherence controllers collectively provide a physical address space that is mapped across all the nodes. A physical address encodes the location of the home node that contains the memory, as well as the directory for the cache line that it addresses. The physical address space is used as a global namespace for shared data. To share memory, two processors must use virtual addresses that translate to the same physical address. In principle, a global physical address space allows a clean separation between the hardware and the OS, in terms of memory management. One can run a CC-NUMA or

Memory Bus

Memory Bus Physical Address

Page Info. Table

Memory/ Page Cache

Directory

Coherence Controller

Network Interface

Fine− Grain Tags

Real Physical Address

Memory/ Page Cache

Network Interface

Coherence Controller

Global Address

Real or Imaginary Physical Address

Protocol Dispatcher

Fine− Grain Tags

Real Physical Address

Protocol FSM

Directory

Physical Address

Protocol Dispatcher

Global Page Table

Protocol FSM

Physical Address

Global Address

Global Address Space

Global Address Space

Virtual Address Spaces

Virtual Address Spaces

Physical Address Spaces Real Addrs.

Physical Address Spaces

Imaginary Addrs

Memory

Memory

coherent copies Client Node

coherent copies

Home Node

Client Node

Home Node

Figure 2: Coherence controller architecture and memory mappings in S-COMA multiprocessors.

Figure 3: Coherence controller architecture and memory mappings in PRISM.

COMA machine with a standard symmetric multiprocessor (SMP) operating system. However scalability and fault containment problems constrain the system size. Maintaining a globally consistent mapping of virtual to physical addresses involves expensive global coordination whenever page mappings change on any node. Further, performance is sensitive to page placement [26]. Allowing physical addresses to directly access memory in any node makes the system a monolithic failure unit. Ultimately, the OS must be significantly restructured for scalability and fault containment [7, 25].

tual pages to global pages as page faults occur [21]. It allocates a local page frame and informs the coherence controller of the mapping between the frame and the faulting global page. The controller uses this mapping to translate physical addresses to global addresses when communicating with remote nodes. S-COMA systems use a naturally distributed OS organization where a separate kernel on each node manages that node’s physical memory resources. Node-private physical addresses permit local management of virtual to physical address translations (although global coordination is still required for virtual to global address translations). Because pages are backed by local frames, performance is less sensitive to page placement. The disadvantage of S-COMA over CC-NUMA is more coherence controller hardware and a higher memory consumption because local page frames must be allocated for accessing global data.

2.2 Multiple, local physical address spaces S-COMA [22] and DVSM [16, 2] systems use multiple, locally-managed physical address spaces. Consider an SCOMA system. Figure 2 illustrates an S-COMA coherence controller and the address translations. Its structure is similar to that of a CC-NUMA controller, but with the addition of fine-grain access tags and a global page table. Part of local memory is managed as a page cache that provides an extra level of caching for shared data. The fine-grain tags maintain the state of each line in the page cache. The global page table translates between local physical addresses and global addresses. The coherence controllers and OS kernels collectively maintain a global address space for naming shared data. To share memory, two processors must use virtual addresses that translate to the same global address. The OS binds vir-

3

PRISM

This section describes PRISM’s hardware and operating system structure, focusing on features that are new and different from existing systems. It demonstrates how these features allow PRISM to provide pages with both CC-NUMA and S-COMA behavior and avoid negative aspects of either by selecting page types appropriately. It describes the steps involved in mapping a region of globally shared memory and accessing that memory. It also describes how PRISM imple-

no action e

al

d mo

loc

consult tags

ode

ma m

s−co

la−numa mode mode=?

com

ma

nd

Fine− Grain Tags

Page Info Table translate, compose send to msg. home node

need remote action?

mo

de

physical address interpret

Figure 4: Page Frame Modes in PRISM and dispatching protocol handlers based on the mode.

ments lazy page migration.

3.1 High-level system overview Figure 3 illustrates PRISM’s coherence controller and address translations. It manages memory using multiple, local physical address spaces. As such, the coherence controller is similar to an S-COMA controller, but with the following additional features that allow it to avoid the negative aspects of S-COMA systems.

 it dispatches protocol handlers based on page frame addresses.  it handles imaginary physical addresses that do not directly address any memory, local or remote.  it avoids encoding home node locations in global addresses. Each node runs an independent OS kernel. Upon a page fault, the faulting node’s kernel allocates either a real or an imaginary page frame and informs its coherence controller of the mapping between the frame and the faulting global page. Upon a cache miss, the coherence controller takes different actions based on the physical address of the missing line. A miss to a real page frame proceeds as in an S-COMA system. A miss to an imaginary page frame causes the controller to translate the physical address to a global address and send a protocol message to the home node. Clearly, PRISM can provide S-COMA style pages. The imaginary page frames additionally allow PRISM to map a page from a remote node without allocating local memory, thus emulating CC-NUMA behavior. We call such pages LA-NUMA (for Locally-Addressable NUMA) pages. The choice to use a real or imaginary page frame for a particular global page can be made completely locally and independently of the other nodes of the system, allowing PRISM to dynamically configure a page with CC-NUMA or S-COMA behavior. The policy for selecting page frame types can be controlled by the application or determined by the operating system.

In general, PRISM supports multiple page frame types with different behaviors and can provide more than just SCOMA and CC-NUMA behavior. For example, a frame may be designated as a synchronization page that invokes a locking protocol for accesses to that page. This paper focuses on using PRISM to provide scalable CC-NUMA and S-COMA behavior as a concrete demonstration of its flexibility. The rest of this section describes implementation details of PRISM’s hardware and operating system for supporting shared memory.

3.2 Hardware support Page Frame Modes. A mode is associated with a page frame. The mode dictates how the controller handles accesses to that frame, as well as the coherence protocol to run. A simple implementation encodes the mode within the higher order bits of the physical address so that the coherence controller may determine the action to take as soon as the physical address appears on the memory bus. A more sophisticated implementation allows the mode of a frame to be set dynamically by storing the mode in the frame’s Page Information Table (PIT) entry. The following modes are of primary interest. We assume a split-phased, fully-pipelined memory bus so that bus retries do not block further memory operations from proceeding.

 Local Mode frames correspond to private local memory that is accessible only to processes running on that node. The coherence controller takes no action on transactions with these addresses and lets the local bus protocol prevail.  S-COMA Mode frames correspond to memory that is used as a page cache for globally shared pages. The controller maintains a two-bit tag for each cache line in a frame in this mode, as in [20, 22]. The tag encodes the following states and cause the corresponding controller actions:

T Transit. The coherence controller asserts a bus retry for all bus transactions involving that cache line. E The line is Exclusive to this node and no other node has a copy. The coherence controller allows all accesses within the node to proceed under the local bus protocol. S The line is Shared and other nodes may have a copy of the line. The coherence controller stalls any write access to the line by issuing either a bus intervention or retry, and coordinates with other nodes to obtain an exclusive copy of the line. I Invalid. The controller stalls any access to that line by issuing a bus intervention or retry, and coordinates with other nodes to obtain a shared or exclusive copy of the line.

 LA-NUMA Mode frames are imaginary frames that do not directly address any memory. No fine-grain tags are needed since there is no need to inhibit local memory from providing data for these frames. The controller acts as the memory backing a LA-NUMA frame, and uses the translation and home information in the PIT to communicate with the home node whenever it must supply data or grant exclusive access to a local processor.  Command Mode frames implement a memory-mapped command interface between the local processors and the coherence controller. The OS uses this command interface during paging activity to communicate page translation information. This command interface may also be used to provide a low-overhead message passing interface to software. The combination of S-COMA and LA-NUMA page frames allows PRISM to incorporate the positive features of S-COMA and CC-NUMA memory. To configure PRISM as a conventional CC-NUMA machine with a conventional single-image OS structure, PRISM’s design may be extended with a true CC-NUMA page frame mode. Accesses to CC-NUMA page frames bypass the PIT, and physical addresses directly identify memory locations at the home node. Page Information Table. The coherence controller translates between physical and global addresses using the Page Information Table which contains an entry for each frame, indexed by frame number. As Figure 5 illustrates, each entry contains the following fields: a global page number, home node information, and the corresponding real page frame at the home node. Translating from physical to global addresses is a simple table lookup. Reverse translating from global addresses to physical addresses is harder. We cannot use the global page number to index a table because that requires an excessively large table. Instead, we use standard OS techniques for implementing sparse address translations, such as treestructured or hash tables. To optimize reverse translation, we cache home node frame numbers in the PIT entries of client pages, and cache client node frame numbers in the directory. These cached

Page Info Table at Home Node

Page Info Table at Client Node

global home home page # node frame #

global home home page # node frame # Physical Frame p

g

H

q

Physical Frame q

g

H

q

Cache Line Directory line node cache line # state list

client node lframe# C P

client node lframe#

Figure 5: Page Information Table and directory organization in PRISM. The home frame# field of PIT entry at the home node of a page contains redundant information, and its space could be reused for part of the cache line directory.

frame numbers are updated when client and home nodes include their local frame numbers as part of normal paging and coherence messages. This allows coherence messages to include a guess of the destination frame number based on the most recent updates. When processing a coherence message, the coherence controller uses the guessed frame number to index the PIT and check if the entry’s global page number matches the global page in the message. If so, reverse translation is complete. Otherwise, the controller uses a hash table to search the PIT for the correct translation. The PIT may be used to implement a memory firewall between nodes of the system for S-COMA and LA-NUMA frames. This is a key feature for enabling fault containment on DSM systems [24]. Since all memory accesses from remote nodes have to be checked against the PIT, an extension of a PIT entry to include a capability list will be able to filter out wild writes to S-COMA and LA-NUMA pages from remote nodes.

3.3 Operating system support PRISM uses a distributed OS structure with multiple, independent kernels that each manages the resources within a single node. The kernels cooperate to provide globally shared memory segments, but do not themselves rely on the shared memory. This structure promotes scalability and fault isolation between the nodes. If a node fails, the rest of the nodes may continue running, although applications using resources on the failed node may be terminated. The OS cooperates with the coherence controller to manage and export shared memory. The novel features are: i) user-controlled binding of variable-sized virtual-address regions to global segments, and ii) the association of modes to shared-memory pages and the page frames backing them. These features are enabled and exported by a set of OS services that can be classified into three categories: Global Naming and Binding, Page Mode Binding, and External Paging.

global page GSID Page Info. Tables

page number

offset

Global Address System V IPC Tables

VSID

page number

virtual page

offset

Virtual Address

OS Page Tables frame number

offset

Physical Address

Figure 6: Translating between virtual, physical, and global addresses.

Global Naming and Binding. Like S-COMA systems, PRISM provides a system-wide global address space as well as virtual and physical address spaces. Figure 6 illustrates the address formats for the address spaces and how they relate to each other. Virtual addresses are composed of a virtual segment identifier (VSID), a page number and an offset. Physical addresses are composed of a frame number and an offset. Global addresses are composed of a global segment identifier (GSID), a page number, and an offset. Global binding occurs whenever virtual addresses are attached to global addresses, and involves global coordination. In CC-NUMA systems, global binding occurs whenever a virtual page is mapped to a page frame at page fault time because physical addresses serve as global addresses. In existing S-COMA systems, global binding occurs whenever a virtual page is mapped to a global page, also at page fault time [21]. In contrast, global binding occurs in PRISM when the application explicit attaches a virtual segment to a global segment at an arbitrary granularity. This allows software to dictate when global binding occurs, and to amortize the cost of global binding over large-sized regions. In PRISM, an application gains access to shared-memory by making system calls for creating global segments and attaching virtual address regions to global segments. The system calls are globalized versions of System V shared memory [4], e.g., shmget and shmat. In order to allow multiple processes of a parallel application to live within a single globally-shared address space, the application loader may create global segments and attach the virtual address space of each of its processes to the global segments at identical virtual addresses. Processes may also share memory by explicitly calling the globalized System V shared-memory routines. Page Mode Binding. Recall that the coherence controller takes different actions based on page frame modes. The OS maintains a pool of free page frames for each mode. During

a page fault, the OS allocates a page frame from one of the pools, depending on the type of faulting page. For virtual pages attached to node-private, non-global segments, the OS allocates local-mode frames. For virtual pages attached to global segments, the OS has a choice of between S-COMA and LA-NUMA mode frames, with the restriction that LANUMA frames may not be used at the home node of a page. The OS associates a page mode with each virtual page attached to a global segment. The page mode determines which free page frame pool to use at page fault time. The best choice of page modes depends on application and system characteristics. For the choice between S-COMA and LA-NUMA modes for a shared page at a client node, the results of Section 4 suggest that a good run-time strategy is to allocate pages in S-COMA mode as long as there is space in the page cache. If the page cache overflows, then one would allocate the most heavily used pages in S-COMA mode and the rest in LA-NUMA mode. The OS also provides a system call for the user to suggest the desired mode. The mode of a page may be changed dynamically and independently at each node by paging out the page and setting its mode to the desired mode. A subsequent page fault on that page will then allocate a page frame based on the new mode. External Paging. The OS needs to communicate with the coherence controller and remote nodes during page faults and page-outs. These external paging actions may be implemented outside of the kernel if the OS provides an external paging interface like Mach’s external pager [28]. When a page fault occurs at the home node, the kernel allocates and initializes a page frame and informs the local coherence controller about the new binding between the page frame and the global page associated with the faulting page. The controller inserts the translation into the PIT and initializes the page’s fine-grain tags to Exclusive. When a page fault occurs at a client node, the kernel sends a message to the home node to ensure that the global page is paged-in at home. The home node maps in the page if necessary, and adds the client node to a list of clients for the page. It then returns the home frame number to the client. Upon receiving a response from the home node, the client node’s kernel informs its controller about the new binding between the global page and the page frame, as well as the home node and home frame number for that page. The controller inserts the translation into the PIT and initializes the page’s fine-grain tags to Invalid. During a home node page-out, the kernel sends a message to all clients for the page requesting them to page out their corresponding client page frames and write back any modified data to the home. Once all clients have acknowledged the page-out request, the home node writes the data out to disk and removes the translation from the PIT. During a client node page-out, the client node’s kernel writes back any modified data to the home node, informs the home node’s kernel of the page out, and removes the translation from the PIT. We ensure that a page is paged-in at home when handling a client-node page fault in order to prevent a clientnode cache miss from causing a page fault at a remote home node. Otherwise, a cache miss may take an arbitrary amount

of time to complete, causing the memory bus to time out and generate an error. Deadlock is also possible, e.g., if the page fault at the home node in turn requires the client node to page-out a page, but the processors at the client node are stalled on cache misses to the same page and unable to service the page-out request. As an optimization to reduce the number of page-in request messages from client to home, after the initial page fault, client nodes can set a home-page-status flag that denote whether a page is mapped in at the home. As long as the flag is set, subsequent page faults on the same page do not need to contact the home node. When the home node unmaps a page, it requests all client nodes to reset that page’s status flag. The home may also amortize the cost of a pagein request message by paging-in multiple pages at once and informing the client to set the flags for those pages.

3.4 Allocating and accessing shared memory We now fit all the pieces together and summarize how shared memory allocated, mapped and accessed in PRISM. The three major steps are: 1. Allocating and attaching a segment. The user allocates a global segment by calling shmget() with a globally unique key. The local kernel sends a message to a global IPC server that checks if the segment has been previously allocated. If not, the IPC server requests all home nodes for the segment to create the segment locally. Once the segment has been allocated at all the home nodes, the IPC server returns a global segment id (gsid) that identifies the segment. The user attaches and gains addressability to the segment by calling shmat(gsid, ...). The kernel sends a message to the global IPC server to increment the attach count for the segment, and sets up a local mapping between the calling process’ virtual address space and the global segment. 2. Paging-in a page. The first access to a newly attached page encounters a page fault. The page fault handler determines the global page and the home node for the faulting virtual page using information set up during shmat, and contacts the home node to map in the page. The home node records the faulting node as a client of the page. The handler allocates a local page frame and informs the local coherence controller to insert the page frame and global page information into its PIT and set the frame’s fine-grain tags to invalid. At this point, all the translations between the global, virtual, and physical addresses are set up. Memory accesses to the page can now occur at hardware speeds. 3. Caching a line of data. After the page fault, the processor re-executes the memory access and a cache miss occurs. The memory request is placed on the bus. The local coherence controller looks up the fine-grain tag for the cache line and finds them to be invalid. It changes the tag to Transit

and asserts a bus intervention to inhibit local memory from supplying the data. The local controller reads PIT to determine the global address and home for the cache line, and sends a request for the line to the home using the global address. The home node’s controller translates the global address to its local physical address, performs the necessary coherence actions, and returns the data to the requester. The local controller then completes the bus transaction and the data is cached at the requesting processor and also copied into local memory. The fine-grain tag for the line is changed to the appropriate state.

3.5 Lazy page migration So far, we have assumed that global pages have fixed homes where the directory information is kept. However, it is sometimes desirable to migrate the home of a page. If a home node is not accessing the page, keeping the page there increases the memory pressure at that node. Coherence traffic may also be reduced by moving the home to one of the nodes that is accessing the page frequently. If the home node needs to reclaim a page frame, it may be more efficient to migrate the page to another node instead of writing it to disk. In existing hardware DSM systems, page migration is a time-consuming global operation that involves unmapping the page and performing a global page table update and TLB invalidation. Lazy page migration is a scheme for migrating a home node efficiently, without the need for global coordination. The main features that enable lazy page migration in PRISM are i) virtual-to-physical address mappings are private to a node, and ii) global addresses do not encode the location of the home node. These features allow the home to migrate without affecting address translations. PRISM’s page migration scheme assigns a fixed static home and a migratable dynamic home to each page. The dynamic home keeps the directory entries and enforces coherence for cache lines of that page. The static home keeps track of the location of the dynamic home for the page and coordinates the migration of the dynamic home. The home node field of the PIT entry is extended to include a dynamic home, as well as the static home identifier. To migrate a page, the static home coordinates only with the old and new dynamic homes to transfer the ownership of the page. After migration, client PIT entries may still refer to the old dynamic home. Thus, coherence requests may use obsolete home node information and be directed to the wrong node. The misdirected request is simply forwarded to the static home node which re-forwards it to the current dynamic home node. The final response to the requester identifies the current dynamic home. The requester updates its PIT entry so that subsequent requests will be sent to the correct dynamic home until it migrates again. To assist in determining when and where to migrate a home node, the coherence controller includes hardware counters for monitoring coherence traffic to each page. The SGI Origin2000 [14] maintains similar counters. Baylor et al. [5] provides more details on the lazy page migration algorithm and evaluates several home migration policies.

Memory Access Type L1 miss, L2 hit Uncached, line in local memory Uncached, line in remote memory 2-party read/write to a modified line 3-party read/write to a modified line 2-party write to shared line (3+n)-party write to shared line TLB miss In-core page fault, local home In-core page fault, remote home

Latency (cycles) 12 36 573 608 866 608 1142+80n 30 2300 4400

Application Barnes FFT LU MP3D Ocean Radix

Table 1: Cache miss latencies and page fault overheads. Water-Nsq

4

Performance evaluation

This section presents experimental measurements from simulating a 32-processor PRISM system composed of eight SMP nodes with four PowerPC processors at each node. It demonstrates the tradeoffs between the different page modes and evaluates three simple run-time policies to select between different page modes.

4.1 Experimental setup We use a PowerPC version of Augmint [18], an executiondriven simulator, to construct architectural models for the L1 and L2 caches, the system memory, the PowerPC SMP memory bus and its coherence protocol, the coherence controller and the inter-node coherence protocol, and the network interface. The models also simulate the OS’s memory management algorithms by directly incorporating code from the AIX virtual memory management subsystem. The page size is 4096 bytes. Latency and contention is accounted for at all system resources except the processor internals and network switches. All cycle times are in terms of processor cycles. Model parameters are chosen to be representative of systems with 5–10 ns cycle times. The memory bus is a 16 byte-wide fully-pipelined split-transaction bus with separate address and data paths that operates at half the processor speed. The one-way, end-to-end network latency is 120 cycles. The coherence controller’s directory is modeled as DRAM with an 8K-entry cache to speed up directory access. We assume a full-map directory. A directory cache hit takes 2 cycles, while a directory cache miss takes 22 cycles. The PIT is modeled as SRAM with a 2-cycle lookup time. PIT entries include home frame numbers, but the directory entries do not include client frame numbers. This optimizes reverse translation of global to physical addresses at home nodes, but not reverse translation for invalidation messages from the home to the client nodes. Reverse translation at client nodes use a hash algorithm to search the PIT. Table 1 presents the uncontended cache miss latencies and paging overheads of the simulated system, as measured by a memory-latency microbenchmark.

Water-Spa

Problem Description and Size Hierarchical N-body 8K particles, 4 iters FFT computation 64K complex doubles Blocked LU decomposition 512 512 matrix, 16 16 blocks Rarefied air flow simulation 20,000 particles, 5 iters Simulation of ocean currents 258 258 ocean grid Radix sort 1M integer keys, radix 1K O(n2 ) water molecule simulation 512 molecules, 3 iters O(n) water molecule simulation 512 molecules, 3 iters







Table 2: Application benchmark types and data sets.

4.2 Applications and machine configurations To determine the tradeoffs between different page modes and the benefits of the flexibility of choosing page modes, we ran eight applications from the SPLASH-I and -II benchmark suites [27] under various page mode policies. The applications, listed in Table 2, are compiled using IBM’s XLC compiler at optimization level -O2. The measurements are taken only during the parallel phase of these applications. Memory behavior is simulated for data accesses to both private and shared memory, and homes for shared-memory pages are assigned round robin across the nodes. The SPLASH benchmarks have small working set sizes. We find that with a 16-KB L1 cache and a 1-MB L2 cache, the working set fits within the L2 cache and the choice of page modes does not affect performance significantly. Communication-related coherence traffic (as opposed to capacity-related traffic) incurs the same cost in either S-COMA or LA-NUMA page modes and dominates under these conditions. To expose the effect of capacity-related misses that occur under more realistic problem sizes we simulate an 8-KB L1 cache and a 32-KB L2 cache, as in [10]. We compare the application performance under the following page mode policies: All shared pages are allocated in S-COMA mode and the page cache in memory is large enough to fit all the pages. This simulates an infinite page cache and represents an optimal configuration because there are no capacity misses to remote nodes.

SCOMA

All shared client pages are allocated in LANUMA mode. This results in the same performance characteristics as a CC-NUMA machine, except for PIT lookup to translate between local and global addresses.

LANUMA

The page cache size is limited to 70% of the maximum number of client S-COMA page frames allocated on each node in the SCOMA configuration. Note that this is a static measure, and the number of client pages in active use may actually be smaller. Page-outs

SCOMA-70

1.5

| |

0.5

|

0.0

|

2.84

SCOMA

|

1.0

LANUMA SCOMA-70 Dyn-FCFS Dyn-Util Dyn-LRU

2.0

|

Normalized Time

|

4.63

2.5

Barnes

FFT

LU

MP3D

Ocean

Radix

Water-nsq

Water-spa

Figure 7: Execution time under different page modes, normalized to SCOMA execution time.

Application Barnes FFT LU MP3D Ocean Radix Water-Nsq Water-Spa

Page Frames Allocated

Average Utilization

SCOMA

LANUMA

SCOMA

LANUMA

3376 4888 2888 1520 8808 13352 1232 672

616 976 592 304 4056 2288 536 160

0.478 0.276 0.576 0.198 0.732 0.330 0.753 0.315

0.576 0.829 0.873 0.677 0.956 0.940 0.894 0.652

Table 3: Page consumption and utilization statistics.

select the Least-Recently-Used (LRU) S-COMA client page frame for replacement. The LRU considers only accesses from local processors. The following adaptive policies use a page cache size identical to SCOMA-70’s, and allocate S-COMA client page frames until the page cache is full. Subsequent client page faults make a choice between allocating S-COMA and LANUMA frames. The OS maps new pages using LA-NUMA mode frames once the page-cache is full.

Dyn-FCFS

The OS queries the local coherence controller for the S-COMA client page frame with the largest number of fine-grain tags in Invalid state. Frames with fine-grain tags in Transit state are skipped. The page currently mapped to that frame is unmapped and set to LA-NUMA mode so that future page faults on this page at this node use LA-NUMA frames. The newlyfreed S-COMA frame is then reallocated to the faulting page. The policy targets pages that are either lightly utilized or used for communication data and changes them from S-COMA to LA-NUMA mode.

Dyn-Util

As in SCOMA-70, the OS pages out the Least Recently Used S-COMA client page frame. In addition, the LRU page is set to LA-NUMA mode so that future page faults on this page at this node use LA-NUMA frames. The newly-freed frame is then reallocated to the faulting page.

Dyn-LRU

Application Barnes FFT LU MP3D Ocean Radix Water-Nsq Water-Spa

Remote Misses

Page-Outs

SCOMA

LANUMA

SCOMA-70

SCOMA-70

267651 122338 115433 279970 629986 254201 111074 40611

3348808 186026 991951 373081 8002014 1394601 970560 178713

295817 128850 115441 289065 1779388 363404 521016 69767

8457 11432 510 856 22457 15883 68290 2949

Table 4: The number of remote misses to shared memory that fetch data from a remote node in the static configurations, and the number of client page outs in SCOMA-70.

4.3 Results and analysis Figure 7 presents the execution time of each of the applications under different configurations, normalized to the execution time on SCOMA. SCOMA has the best performance since no page outs are necessary, and capacity-related misses are serviced by local memory. However, it also consumes the most memory. Table 3 shows the total number of page frames allocated and the average page frame utilization for both private and shared memory in the SCOMA and LANUMA configurations. Page frame utilization measures the fraction of cache lines within an allocated frame that is actually accessed. This is a static measure that does not take into account the frequency of accesses to each line. SCOMA uses significantly more pages and results in lower page utilization than LANUMA. If there is significant memory pressure, SCOMA may suffer from the software analog of L1/L2 cache capacity misses: client pages have to be paged out. SCOMA-70 models the effect of paging activity due to SCOMA’s higher memory consumption by restricting the page cache size to be 70% of the number of client page frames allocated at each node in SCOMA. The difference between the number of page frames allocated by SCOMA and LANUMA in Table 3 yields the number of client frames. Static configurations. SCOMA-70 significantly outperforms LANUMA in Barnes, LU, Ocean and Radix. The pri-

Application Barnes FFT LU MP3D Ocean Radix Water-Nsq Water-Spa

Remote Misses

Page-Outs

Dyn-FCFS

Dyn-Util

Dyn-LRU

Dyn-Util

Dyn-LRU

709684 122338 119378 280679 1253209 492143 530448 81326

1354715 122364 116931 280413 830618 495263 814619 75038

807393 124944 115441 283559 3709983 368294 284861 102713

930 5558 509 404 1449 3878 855 251

895 5651 509 413 1464 3883 873 258

Table 5: The number of remote misses to shared memory that fetch data from a remote node, and number of client page-outs in the adaptive configurations. Page-outs do not occur in Dyn-FCFS.

mary reason is the number of capacity-related cache misses. Table 4 shows the number of remote cache misses and client page outs. Capacity-related misses account for the extra number of remote misses in LANUMA and SCOMA-70 over SCOMA. In these applications, LANUMAencounters a much larger number of capacity-related misses than SCOMA70. SCOMA-70’s page cache reduces the number of remote misses over LANUMA, but it comes at the price of increased paging activity. In Water-nsq, the number of remote misses in LANUMA and SCOMA-70 are comparable but paging overhead tips the balance in favor of LANUMA. Neither SCOMA-70 nor LANUMA is ideal when cache and memory resources are constrained. In LANUMA, evicted cache lines may need to be written back then refetched from a remote home. In SCOMA-70, the same happens for cache lines in client pages that are paged out. The tradeoff depends on application access patterns, the L1/L2 cache size, and the S-COMA page cache size. Previous research [23, 29, 10] have also observed similar tradeoffs between CC-NUMA, COMA and S-COMA architectures. In these experiments, SCOMA-70 generally performs better than LANUMA. These results are contrary to those in Falsafi and Wood [10] where CC-NUMA generally performs better than S-COMA. The reason for this difference lies in the size of the S-COMA page cache. We set the page cache size at 70% of the maximum number of client pages allocated by SCOMA, while Falsafi and Wood fix the page cache size at 320 KB. A 320-KB page cache would provide only 5%–25% of the necessary number of client pages for Barnes, FFT, LU, Ocean and Radix, and cause enough paging activity to favor LANUMA.

Dyn-Util require hardware support in the coherence controller to monitor reference traffic to the local page cache. To avoid hardware support, the OS can also approximate Dyn-LRU by using standard pseudo-LRU algorithms developed for normal paging. Note that converting a page from S-COMA to LA-NUMA mode is a purely node-local decision, and runtime policy is invoked only at page faults and do not incur any overhead during normal operation. Figure 7 shows that the adaptive configurations outperform the static configurations in most cases and perform quite well overall. Table 5 shows that the adaptive configurations significantly reduce both the number of remote misses over LANUMAand the number of client page outs over SCOMA-70. Except for a few cases, they achieve within 10% of ideal performance. However, Barnes on Dyn-Util and Dyn-LRU and Ocean on Dyn-Util perform significantly worse than on a static configuration of SCOMA-70. Table 5 reveals the cause of the poor performance. In these cases, the adaptive configurations experience significantly more remote misses than SCOMA70. This occurs because reuse pages were converted to LANUMA mode, and cache capacity evictions caused the data on those pages to be repeatedly refetched from remote home nodes. The algorithms developed in [10] for determining when to switch from CC-NUMA to S-COMA may be used to convert such reuse pages back to S-COMA mode. In fact, we can combine the algorithms to implement an adaptive configuration that switches modes in both directions. Designing and evaluating more sophisticated algorithms is a subject for further research.

Adaptive configurations. In a number of applications, the static configurations of LANUMA and SCOMA-70 perform significantly worse than the ideal performance indicated by SCOMA. Recall that SCOMA does not encounter capacityrelated misses. We have an opportunity to improve performance by blending page modes, either through explicit user/compiler selection or run-time adaptation. Here, we investigate three adaptive run-time policies as described above: Dyn-FCFS, Dyn-Util, and Dyn-LRU. All three policies initially service page faults by allocating S-COMA page frames, and differ only in the action they take once the page cache is full. Dyn-FCFS is implemented purely in the OS and does not require any hardware support. Dyn-LRU and

Impact of PIT translation overhead. Finally, we investigate the effect of implementing the Page Information Table in DRAM instead of SRAM. Recall that PIT access time lies in the critical path of memory latency and the cost of using SRAM may be a factor. We find that increasing PIT access time from 2 cycles to 10 cycles increases execution time by less than 2% for LU, MP3D, Ocean, Radix, Water-nsq and Water-spa, 5% for FFT, and 16% for Barnes. The performance degradation on Barnes when using a DRAM PIT may be mitigated by including client frame numbers in the directory entries. This would reduce the number of PIT lookups necessary when servicing invalidation requests, albeit at the price of increased directory sizes.

In a pure CC-NUMA system, the physical address directly identifies the memory location and does not need to incur the overhead of accessing a PIT when running the cache coherence protocol. However, the results above suggest that with a PIT implemented in SRAM, LA-NUMA pages will not significantly degrade application performance over CCNUMA pages. In summary, the results above demonstrate the performance benefits of PRISM’s ability to dynamically and independently allocate a page in different modes at each node.

5

Related work

A customizable system that can adapt to application requirements has the potential to outperform fixed systems. Previous research has shown that the ability to customize cache coherence protocols [6, 13, 20] yields significant performance gains. More recent work on combining CC-NUMA and S-COMA systems [21, 10] consider impact of the memory management structure on performance. However, their designs retain some negative aspects, e.g., global physical addresses and translations in CC-NUMA systems and increased memory consumption in S-COMA systems. COMA systems suffer from similar memory consumption problems as S-COMA. Dahlgren and Landin [9] reduce memory consumption in COMA systems by breaking cache inclusion, thus allowing data to be cached in a processor without a requiring a copy to be present its node’s attraction memory. PRISM’s LA-NUMA pages achieve the same effect and can be managed by software. The Sun S3.mp system [19, 21] can be statically configured as either a CC-NUMA or S-COMA machine. S3.mp has separate operating systems for its CC-NUMA and SCOMA configurations, and pages cannot be dynamically set to CC-NUMA or S-COMA behavior. Further, S-COMA on S3.mp directly encodes physical memory locations in its global addresses so that lazy page migration is not possible. The MIT StarT-NG [8] is a CC-NUMA system that uses an OS structure with multiple, local physical address spaces. Inter-node communication uses virsical addresses [3] that encode the home node and virtual address at the home node. Virsical addresses directly encode physical memory locations thus preventing lazy page migration. A recent redesign, StarT-Voyager, has added support for dynamically configuring pages as CC-NUMA or S-COMA and avoids encoding memory locations in virsical addresses. Reactive-NUMA [10] is an extension to a CC-NUMA architecture to provide S-COMA pages. However, like CC-NUMA systems, R-NUMA uses physical addresses as global addresses. Thus, it requires globally consistent virtual-to-physical address translations and is susceptible to wild writes from any node in the system. It also does not permit lazy page migration. R-NUMA’s adaptive policy requires the home node to monitor cache miss traffic in hardware counters, and involves coordination between the home node and client nodes when changing a page from CCNUMA to S-COMA mode.

6

Summary and conclusions

This paper presents PRISM, a scalable shared-memory system that allows memory pages to be configured with different behaviors so that performance may be tailored to application needs. Each page frame has a mode that dictates the behavior to be provided for that page by the controller. The OS provides system calls, as well as run-time algorithms to select the behavior of a page independently and dynamically at each node. The architecture can provide shared memory in both SCOMA and CC-NUMA styles. We introduce the novel concept of Locally-Addressed NUMA (LA-NUMA) memory that allows the system to address remote memory with nodelocal instead of global physical addresses. LA-NUMA pages behave like CC-NUMA pages except for an additional layer of translation in the coherence controller. Avoiding global physical addresses enables localized memory management, lazy home migration, and enhances fault containment. Simulations of PRISM on a set of SPLASH benchmarks demonstrate the benefits of PRISM’s flexible choice of SCOMA and LA-NUMA behavior. The optimal choice depends on cache sizes, working set sizes and run-time communication patterns. There is no significant performance difference for working sets that fit within the L1/L2 caches. For working sets larger than the L1/L2 caches, S-COMA’s page cache acts as a third level cache and outperforms LANUMA. For working sets larger than the page cache, more paging occurs in S-COMA, and LA-NUMA performs better. By allowing pages to be allocated in either S-COMA or LA-NUMA mode, PRISM can match the performance of S-COMA or CC-NUMA systems. Further, because page modes can be independently and dynamically selected for each page at each node, PRISM outperforms both S-COMA and CC-NUMA when the optimal configuration is a mix of S-COMA and LA-NUMA pages. We examined three simple, but effective run-time policies to select page modes dynamically. The adaptive runtime policies significantly outperform static LA-NUMA and S-COMA configurations, and usually achieve within 10% of optimal performance. Further study is needed to examine more sophisticated and provably well-behaved algorithms for selecting page modes. In conclusion, PRISM integrated hardware and OS architecture incorporates the positive aspects of existing DSM architectures, from both cache performance and memory management perspectives. It has the potential to end the debate between NUMA and COMA architectures.

Acknowledgments We would like to thank our IBM colleagues, especially Sandra Baylor, Alan Benner, Joefon Jann, Yarsun Hsu, Dean Liberty, Jamshed Mirza, David Sadler, and Gautam Shah. Thanks also to Mark Giampapa for porting Augmint to PowerPC and to Maged Michael for help with Augmint models. We had fruitful discussions with members of the MIT Computation Structures Group.

References [1] A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 2–13, Santa Margherita, Italy, June 1995. [2] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18–28, February 1996. [3] B. S. Ang, D. Chiou, and Arvind. Issues in Building a CacheCoherent Distributed Shared-Memory Machine using Commercial SMPs. CSG Memo 365, MIT Laboratory for Computer Science, Cambridge, MA 02139, February 1995. [4] M. J. Bach. The Design of the UNIX Operating System. Prentice-Hall, Englewood Cliffs, NJ, 1986. [5] S. Baylor, K. Ekanadham, J. Jann, B.-H. Lim, and P. Pattnaik. Lazy Home Migration for Distributed Shared Memory Systems. In Proceedings of the Fourth International Conference on High Performance Computing (HiPC), Bangalore, India, December 1997. [6] J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the Thirteenth Symposium on Operating System Principles. ACM, October 1991. [7] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: Fault Containment for Shared-Memory Multiprocessors. In Proceedings of the Fifteenth Symposium on Operating System Principles. ACM, December 1995. [8] D. Chiou, B. S. Ang, Arvind, M. Beckerle, A. Boughton, R. Greiner, J. E. Hicks, and J. C. Hoe. StarT-NG: Delivering Seamless Parallel Computing. In Proceedings of EURO-PAR ’95, Stockholm, Sweden, 1995. [9] F. Dahlgren and A. Landin. Reducing the Replacement Overhead in Bus-Based COMA Multiprocessors. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture, pages 14–23, San Antonio, TX, February 1996. [10] B. Falsafi and D. Wood. Reactive-NUMA: A Design for Unifying S-COMA and CC-NUMA. In Proceedings of the 24th Annual International Symposium on Computer Architecture, Denver, CO, June 1997. [11] S. Frank, H. Burkhardt, and J. Rothnie. The KSR1: Bridging The Gap Between Shared Memory and MPPs. In Proceedings of IEEE COMPCON ’93, pages 285–294, February 1993. [12] E. Hagersten, A. Landin, and S. Haridi. DDM - A CacheOnly Memory Architecture. IEEE Computer, 25(9):44–54, September 1992. [13] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of 21st Annual International Symposium on Computer Architecture, pages 302–313, Chicago, IL, April 1994. [14] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 241–250, Denver, CO, May 1997. [15] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. IEEE Computer, pages 63–79, March 1992.

[16] K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transations on Computer Systems, 7(4):321–359, November 1989. [17] T. Lovett and R. Clapp. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 308–317, Philadelphia, PA, May 1996. [18] A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures. In Proceedings of the 1996 IEEE International Conference on Computer Design (ICCD), October 1996. [19] A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S.Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the International Conference on Parallel Processing (ICPP), August 1995. [20] S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21st Annual International Symposium on Computer Architecture, Chicago, IL, April 1994. [21] A. Saulsbury and A. Nowatzyk. Simple COMA on S3.mp. http://playground.sun.com/pub/S3.mp/simple-coma/ isca-95/present.html, 1995. [22] A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An Argument for Simple COMA. In Proceedings of the First IEEE Symposium on High-Performance Computer Architecture, pages 276–285, January 1995. [23] P. Stenstrom, T. Joe, and A. Gupta. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 80–92, May 1992. [24] D. Teodosiu, J. Baxter, K. Govil, J. Chapin, M. Rosenblum, and M. Horowitz. Hardware Fault Containment in Scalable Shared-Memory Multiprocessors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 73–84, Denver, CO, June 1997. [25] R. Unrau, O. Krieger, B. Gamsa, and M. Stumm. Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design. Journal of Supercomputing, 9(1/2):105– 134, March 1995. [26] B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 279–289, Cambridge, MA, October 1996. ACM. [27] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995. [28] M. Young, A. Tevanian, R. Rashid, D. Golub, J. Eppinger, J. Chew, W. Bolosky, D. Black, and R. Baron. The Duality of Memory and Communication in the Implementation of a Multiprocessor Operating System. In Proceedings of the Eleventh Symposium on Operating System Principles, pages 63–76. ACM, Nov 1987. [29] Z. Zhang and J. Torrellas. Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA. In Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture, pages 272–281, San Antonio, TX, February 1996.