Download as a PDF

6 downloads 78 Views 242KB Size Report
miss in on-chip caches, and can only monitor a very small time window. ... Keywords: Address tracing, hardware performance monitor, operating system, shared-.
Low Perturbation Address Trace Collection for Operating System, Multiprogrammed, and Parallel Workloads in Multiprocessors 1

Russell Daigle2 Chun Xia ,

, and

Josep Torrellas

Center for Supercomputing Research and Development and Computer Science Department University of Illinois at Urbana-Champaign, IL 61801

Abstract

While address trace analysis is a popular method to evaluate the memory system of computers, getting accurate traces of operating system, multiprogrammed, and parallel workloads in multiprocessors is hard. This is because the true behavior of these real-time loads can be easily altered by the tracing activity. To minimize perturbation, it is usually necessary to use trace-gathering hardware devices. Unfortunately, these devices usually gather limited information. For example, they often only capture physical addresses, can only collect references that miss in on-chip caches, and can only monitor a very small time window. In this paper, we improve the capability of such devices. We present a methodology to instrument operating system and applications to transfer a wide variety of information to the trace-gathering hardware with very little perturbation. For instance, the information transferred includes the virtual to physical address mapping or the sequence of basic blocks executed. Each piece of information is transferred as cheaply and non-intrusively as one or more machine instructions and one or more cache misses. With this approach, instrumenting every basic block in the operating system and application code causes only a 1.5-2.8 execution slowdown. This is in contrast to a 10-times slowdown for software-based instrumentation in uniprocessors. Furthermore, the behavior of the operating system is largely una ected.

Keywords: Address tracing, hardware performance monitor, operating system, sharedmemory multiprocessor, cache misses.

1 Introduction Address trace analysis is widely used for performance evaluation of computer memory systems. By analyzing the references issued by the processor, trace-driven evaluations can give accurate insights into cache performance, synchronization activity, and memory system contention. However, the gathering of the traces must not signi cantly perturb the system being analyzed. Otherwise, the evaluation would be inaccurate. Such perturbation is usually not an issue when gathering traces of single applications in uniprocessors, since the references and their order will not change much. However, non-intrusively attaining traces of operating system, multiprogrammed, and parallel workloads in multiprocessors is particularly hard due to the real-time nature of these workloads. In these cases, any interference due to gathering address traces could severely alter the references and their order. For example, if tracing signi cantly slows down the operating system, the relative timing of asynchronous events such as timer or input/output interrupts will change and the traces will be inaccurate. In these workloads, the approach to trace collection that involves the least perturbation entails the use of trace-gathering hardware devices like data acquisition systems (DAS) or hardware This work was supported in part by the National Science Foundation under grants NSF Young Investigator Award MIP 94-57436, RIA MIP 93-08098, MIP 93-07910, and MIP 89-20891; NASA Contract No. NAG 1 613; and grant 1-1-28028 from the Univ. of Illinois Research Board. 2 Currently with Tandem Computers Inc., Cupertino, California. 1

1

performance monitors. Unfortunately, while we would like such devices to gather di erent types of information, they are very limited. Indeed, they usually capture physical addresses only, which makes the data hard to analyze. Furthermore, they only capture the references that miss in on-chip caches or, if they are attached to the bus, only the references that access main memory. Finally, these systems have a limited trace storage capacity, which allows them to monitor only small time-windows in the order of tenths of a second. As a result of all this, only limited performance evaluation studies can be performed. For such common performance monitoring systems, this paper presents a general methodology to instrument operating system and applications with very little perturbation. The method consists of augmenting the code with dummy load machine instructions. These instructions issue references that bypass the caches and are either TLB-unmapped or have a virtual to physical address translation that we can nd out. The addresses accessed by these instructions are captured by the data acquisition system and interpreted by the trace postprocessing program. Using this approach, the processor can non-intrusively pass information to the trace postprocessing program with one or more cache misses. In this paper, we also describe two prototype implementations of this approach. Finally, we measure the perturbation caused by the instrumentation and show that, even with much instrumentation, the behavior of the operating system is largely una ected. The remainder of the paper is organized into four main sections: Section 2 discusses previous work; Section 3 describes the two system implementations; Section 4 presents our instrumentation methodology; and Section 6 evaluates the accuracy of our approach. As a support to other sections, Section 5 discusses the workloads used in the evaluation, and Section 7 shows examples of the data that we have gathered using our instrumentation methodology.

2 Previous Work Address trace collection systems for operating system, multiprogrammed, and parallel workloads in multiprocessors can be classi ed into hardware-based and software-based. Hardware-based systems rely on a trace-gathering hardware device, either attached to the machine or built into the architecture, while software-based systems are purely based on software instrumentation of the code to be traced. Hardware-based systems [23, 26, 27] can potentially gather very detailed information with practically no perturbation. However, they can be dicult to use. Indeed, often, they collect physical addresses only, are prevented from recording all the references by one or more levels of caches, and have very limited trace storage capabilities. The earliest of these systems, ATUM-2 [23], gathered traces of a 4-processor VAX using a modi cation to the machine's microcode. The microcode was changed so that the addresses of all the memory locations accessed by the processors were written to memory. The system generated virtual addresses. Unfortunately, the extra work slowed down the machine by 20 times. Furthermore, the traces were collected in a bu er that lled in less than a second and had to be dumped to disk. While the bu er was being dumped, the references issued by the processors were lost. The other two systems [26, 27] are the ones described in this paper. They can generate continuous traces without perturbing the system much. They are described in the next section. Other trace-collecting hardware devices have been used in uniprocessors. One example is ATUM [1], the uniprocessor predecessor of ATUM-2. Other examples are various schemes involving data acquisition systems, for example the one described by Alexander et al [2]. This system uses a logic analyzer to gather address traces from a NS32016 microprocessor. This system can gather complete address traces because the processor monitored does not have on-chip caches. In general, all these systems tend to work with samples because they have a limited trace storage capacity. Finally, there are hardware monitors that simply count events. One example is Monster [20], a 2

logic analyzer carefully programmed to count cache misses, TLB misses, write bu er stalls, and other events. Other examples are the counter-based systems that have been used on a VAX [11], DASH [17], Cedar [16], and the Alpha [12] among others. Counters, however, are less exible than traces because they only allow the measurement of a very limited number of event types. Software-based systems, on the other hand, are attractive because they do not need special hardware support, and hence are portable. However, the disadvantage is that they cause perturbation into the system they are measuring. While the perturbation may be acceptable for single applications in uniprocessors, it may not be acceptable for operating system, multiprogrammed, and parallel workloads in multiprocessors. We know of no software-based system that can handle all these workloads. There are systems that run on multiprocessors and generate traces of a single parallel application. An example is [14]. There are, however, systems that handle these workloads for uniprocessors, like the ones described by Mogul and Borg [18] and Chen [10]. These two systems run on a DEC workstation and generate traces of a multiprogrammed load and the operating system respectively. In these systems, the program is instrumented by inserting subroutine calls at the beginning of each basic block and at each load and store operation. The callee subroutine stores the addresses of the memory locations accessed by either data or instructions into an internal trace bu er that is managed by the operating system. Once the bu er is full, it is either dumped to disk or analyzed on the y, with no traced code executing meanwhile. Overall, this system induces some perturbation. Indeed, 8 out of 64 general purpose registers are reserved for tracing in order to minimize memory references. Furthermore, tracing slows down execution by about 10 times. A related software scheme that has been implemented in uniprocessors is called trap-driven simulation [21, 28]. In this approach, simulations are not driven by traces; they are driven by kernel traps. These traps allow the simulation of a cache as the kernel executes. Indeed, memory traps are set on addresses that are currently not in the simulated cache. When that address is accessed, the kernel traps. After trapping, the kernel increments the count of number of simulated misses, clears the trap since the address is now considered to be in the simulated cache, and sets a trap for the address displaced from the simulated cache. TLBs can be simulated in a similar way. Overall, this approach has a speed advantage over trace-driven simulations since it runs at hardware speed when the accesses hit in the simulated cache. In addition, both operating system and application e ects are considered. However, there is overhead involved in the extra instructions executed. Furthermore, this approach cannot generate as much information as trace driven simulations. Examples of this approach are Uhlig et al's Tapeworm II [21, 28] and Talluri's [25] TLB simulator. Finally, there are many software-based systems that only consider application traces and ignore the operating system. Many of them run on uniprocessors and may either trace a multiprocessor program (for example [13]) or a uniprocessor program (for example [9]). Other application-only schemes, instead, are based on traps (for example [22]).

3 System Implementations While the previous section has shown that there are many types of trace collection systems, in this paper we focus on those that are able to accurately collect traces of operating system, multiprogrammed, and parallel workloads in multiprocessors. These systems tend to have the following characteristics: 

They have special hardware that collects traces of the addresses referenced by the processor. This trace bu er allows the system to collect information quickly, without slowing down the workloads much. As a result, the workloads may still behave as if they were uninstrumented. 3







Since the trace bu er lls quickly and we need to examine long traces, it is desirable for the system to stop automatically before the trace bu er gets full. Then, the trace bu ers can be unloaded, and the system automatically re-started. The system is often unable to gather all the references issued by the processor. This is because it is often physically impossible to connect the trace bu er between the processor and the primary cache. Therefore, we often can only gather references that miss in the caches. In these systems, then, we need special instructions that bypass the caches. These instructions will transfer critical information from the processor to the trace bu er without being intercepted by the caches. In general, modern systems provide the capability of bypassing the caches. Finally, the system will likely gather physical, not virtual addresses. This is because, in most systems, the virtual to physical address translation takes place on chip. Unfortunately, virtual addresses are helpful to analyze what part of the code or data originated a given reference or cache miss pattern. For this reason, it is useful for the trace generation systems considered to use some type of reverse mapping to provide the virtual addresses too. This usually involves somehow dumping the information in the TLBs or page tables into the trace.

In the course of the past few years, we have developed two trace generation systems that have these characteristics and are connected to two bus-based shared-memory multiprocessors. The rst one, which we call the Silicon Graphics System [26], is connected to a Silicon Graphics multiprocessor, while the second one, called the Alliant System [27], is connected to an Alliant multiprocessor. In the remainder of this section, we describe these two systems. In the next section, we will analyze in detail the instrumentation methodology used.

3.1 The Silicon Graphics System This trace generation system [26] uses a Silicon Graphics POWER Station 4D/340 [7]. The machine is a bus-based cache-coherent multiprocessor with four 33-MHz MIPS R3000 processors. Each processor has three physically-addressed, direct-mapped caches: a 64-Kbyte instruction cache, a 64-Kbyte rst-level data cache, and a 256-Kbyte second-level data cache. All caches have 16 byte blocks. The caches are kept coherent with the Illinois cache coherence protocol. The machine has 32 Mbytes of main memory. The operating system running on the machine is release 3.2 of IRIX, a UNIX [6] System V based operating system. Each processor runs one thread of IRIX. All threads share all the operating system data. In addition, they can execute all operating system functions except for network functions, which have to run on CPU 1. A hardware monitor connected to the bus of the machine records all bus activity without a ecting the system. The functionality of the monitor is controlled by a software-programmable Xilinx gate array [29]. For each bus transaction, the monitor captures the 32 bits of the physical address, whether the transaction is a read or a write, the originator of the transaction (either one of the four processors for a cache miss or the I/O system for a DMA transaction), and the number of cycles elapsed since the last bus transaction. The latter gives the exact timing of the transaction. All this information is stored in a bu er with 2 million entries. The trace bu er is also capable of ignoring transactions with a certain combination of the previously mentioned bits. Depending on the memory reference frequency and miss rate of the workload that is running, bus transactions ll up the trace bu er in 0.5 to 4 seconds. To avoid losing traces, the system works as follows. At the beginning, a master process forks o the workload programs, sets up an alarm for it to wake up at regular intervals, and goes to sleep. At regular intervals, the alarm wakes up the master process. Right after waking up, the master process checks the trace bu er. If the fraction of the trace bu er that remains empty is more than a threshold value, then the master 4

goes back to sleep. Otherwise, the master suspends all processes, dumps the traces in the bu er to disk, and restarts the processes that were being traced. In this system, the value of the threshold that determines whether or not the master goes back to sleep is chosen in conjunction with the alarm period so that there is never any risk of bu er over ow. Clearly, this setup allows for the gathering of uninterrupted address traces. However, it requires some extra support. First, the master process is given real time priority to ensure that when it wants to run it is never held back by any other process. This guarantees that the state of the bu er is checked at the correct times, preventing any bu er over ow. Second, we modi ed the suspend signal so that it is processed immediately by all processes that receive it. Finally, instead of dumping the trace in a disk locally, the master process sends the trace to a remotely-mounted disk. In the remote machine, another program postprocesses the trace in parallel while the next segment of the trace is being generated and transferred. This setup exploits parallelism and, in addition, the activity of the postprocessing program does not pollute the caches and memory of the system under measure. In this system, accesses that hit in any of the three per-processor caches would be invisible to the monitor. However, we can force some accesses to bypass the two levels of caches by using a certain range of addresses. The capability to bypass caches was added by the chip designers to access memory-mapped registers. In practice, any memory access can be made uncachable by setting its four most signi cant bits to a certain value { bits that are otherwise not used to address memory.

3.2 The Alliant System This trace generation system [27] uses a 4-processor Alliant FX/8[3] bus-based multiprocessor. In this machine, each processor has a 16-Kbyte on-board direct-mapped instruction cache. In addition, all processors share one 512-Kbyte 4-way interleaved cache. The line size of the cache is 32 bytes. This cache receives all data accesses plus the instruction accesses that miss in the primary caches. The machine has 32 Mbytes of main memory. Between the primary caches and the shared one, we connected a hardware performance monitor [4] that can gather the address of all references that reach the cache. The monitor has 4 trace bu ers connected to the four processors. Each trace bu er gathers the physical addresses of the memory locations referenced by the loads and stores from one processor. Each bu er contains enough RAM to store one million events. The information stored per reference includes 32 bits for the physical address accessed, 20 bits for a time-stamp, a read/write bit, and other miscellaneous bits. The trace bu ers typically ll in around seven hundred milliseconds. When one of the four bu ers nears lling, it sends a non-maskable interrupt (NMI) to each processor. Upon receiving the interrupt, processors trap into an exception handler that we wrote and halt in less than ten machine instructions. Then, a workstation connected to the performance monitor dumps the traces in the bu ers to disk. Alternatively, the data may be processed on the y while being read from the bu er and then discarded. Once the bu ers have been emptied, the NMI is cleared and the processors resume. Clearly, this technique allows for the gathering of uninterrupted address traces. Furthermore, this is done with negligible perturbation. Indeed, we coded the interrupt handler so that, upon receiving a NMI, each processor halts very quickly. In this system, the performance monitor cannot gather the instruction accesses that hit in the primary cache. However, we bypass the primary cache by issuing a data access instead. The multiprocessor operating system running in this system is Xylem [15], a derivative of Concentrix 3.0. Concentrix is Alliant's commercial operating system and is based on 4.2 BSD UNIX. Xylem is symmetric and all data is shared by all processors. In both this system and the previous one, the halting of processors has the e ect of tolerating 5

any outstanding I/O operation. This is because the outstanding I/O operation will be completed by the time the processors restart. However, this perturbation is not very signi cant since, given that the trace bu ers run for over 34 of a second before lling up, only a small fraction of the total I/O will be outstanding when they ll up.

4 Instrumentation Methodology Any technique used to instrument the workloads in the two trace generation systems described above must have two characteristics. First, given the real-time nature of the workloads, it must cause very little perturbation. Second, all the instrumentation information that we want to pass to the trace must be encoded as memory accesses to addresses that the postprocessing program can recognize and decode. To address these issues, we have designed an ecient encoding of events that allows us to transfer any instrumentation information from the workload to the trace bu er cheaply and non-intrusively with one or more assembly instructions and one or more cache misses. These instructions are load instructions that issue references that bypass the caches and are either TLB-unmapped or have a virtual to physical address translation that we can nd out. In this section, we rst describe the events that we instrument and then we discuss how we encode the instrumentation.

4.1 Workload Events that We Instrument In the course of our experiments, we have monitored a variety of events: 







To separate application from operating system references, we need to detect every time we enter and exit the operating system. This requires instrumenting the exception handlers in the operating system. While monitoring operating system entries is easy, monitoring operating system exits is dicult. The reason is that, often, the operating system exit code returns to other operating system code instead of to the application. Furthermore, in general, the number of times that the entry code is executed and the number of times that the exit code is executed are di erent. Therefore, to determine if we really return to the application, we need to check the value of a status register before the return instruction. To generate the dynamic call graph of a program, we need to detect the sequence of subroutine invocations. This involves adding instrumentation at the beginning of each subroutine and, if the program that we are tracing is the operating system itself, right after a subroutine call instruction. The latter instrumentation is needed to determine, for the operating system, what subroutine is executing at any given point in time. This is because subroutine calls and returns in the operating system are not fully nested. For example, given two subroutines Caller1 and Caller2 that have calls to Callee, oftentimes Callee is called from Caller1 and returns to Caller2. To determine the exact sequence of instructions executed, it is sucient to determine the sequence of basic blocks executed. This involves instrumenting the beginning of each basic block. To determine whether an application reference is an access to the instruction or to the data cache, we need to know its virtual address. This suces because, for a given operating system, all instructions are in a range of virtual addresses. Another reason to obtain virtual addresses is to nd out what instruction or data structure access is responsible for a given access or 6





miss. Unfortunately, the trace bu er stores physical addresses only. Therefore, we have to somehow save in the trace bu er the virtual to physical page translations for all running processes. There are two ways of doing this. One way is to intercept the virtual to physical page translations as they are loaded in the TLBs and save them in the bu er. The second way is to intercept the virtual to physical page translations as they are assigned in hard page faults and save them in the bu er. Obviously, the rst alternative involves saving more information as the program runs. In the Silicon Graphics System we used the rst alternative because we also wanted to study the performance of the TLBs. We rst wrote a system call to dump the TLBs when the workloads were initially started and every time they were restarted. Then, we instrumented the assembly routines that modify the TLBs to inform the trace bu er of the change made. In the Alliant System, instead, we monitored hard page faults. Every time a page fault for the process occurred, we dumped the new translation. In any of the two schemes, however, we need to save the virtual page number, the physical page number, some page status bits (valid, dirty, etc.), and an identi er of the owner process. In the case of saving a change in the TLB, we also save the index of the changed TLB entry. To account the references and misses to the right process, we need to know what process is running in each processor at any time. This involves instrumenting the scheduler routine in the operating system. In our systems, at every context switch, we send to the tracer the PID of the new running process. Finally, we instrument many other events in the operating system. For instance, we instrument entries and exits to the code than handles interrupts, TLB faults, system calls, and the rest of exceptions to measure the time duration and number of misses that they involve. Similarly, we determine when the caches are invalidated by the operating system and how many cache lines are invalidated.

4.2 Encoding the Instrumentation The trace bu er in the systems described stores addresses only. Therefore, all the information that we want to pass to the trace we encode it as memory addresses accessed by dummy references. These accesses we call escape accesses. To make the system work, however, we need to solve two problems: handle the address distortion induced by the TLB translation and prevent the caches from intercepting the accesses. Focusing rst on the TLB translation problem, we note that the code of the operating system usually has a xed virtual to physical mapping. Indeed, in both the Silicon Graphics System and the Alliant System, we issue accesses to virtual addresses in the range used by the operating system code and generate expected physical addresses. In addition, in the Silicon Graphics System, if the four most signi cant bits of the virtual address are set to a certain value, the TLB is bypassed altogether. Therefore, in the Silicon Graphics System, the operating system can completely control the physical addresses issued. In the Alliant System, however, only the area used by the operating system code and the static data has a xed mapping. If we want the operating system to issue escape accesses to other pages, it rst needs to inform the tracer of the physical mapping of those pages. Similarly, if we want the application to issue escape accesses to its own code or data pages, we need the operating system to inform the tracer of the physical mapping of the application pages. We will see an example of how this is done later. 7

To solve the second problem, namely preventing the caches from intercepting the escape accesses, we rely on the fact that many systems provide cache bypassing loads and stores. This is often done to allow processors to read memory-mapped registers. This is what occurs in the Silicon Graphics System: a range of addresses bypass the two levels of caches. In practice, any memory access can be made TLB-unmapped and uncachable by setting its four most signi cant bits to 0xB { bits that are otherwise not used to address memory. In the Alliant System, however, there is no such thing. However, all data accesses in the Alliant System bypass the primary cache and therefore are visible to the tracer. All our escape accesses are data accesses. To show the approach to encoding used, we examine two examples, one used in the Silicon Graphics System (a RISC instruction set) and one used in the Alliant System (a CISC instruction set). We consider each system in turn.

Example of Encoding for the Silicon Graphics System

In this system, the escape accesses are loads, to avoid changing memory state. Of course, the data that we read is discarded. Furthermore, the reads must be distinguishable from the regular reads in the program. We ensure this by issuing the key escape accesses as byte reads to odd (not even) addresses in a range where only operating system code resides. Obviously, real instruction reads will access even locations only and, therefore, no confusion is possible. Some of the escape accesses are very simple. For example, Entering the OS is encoded as follows: lb r0, 0xB0000001

;load byte from address 0xB0000001 into register 0

This instruction bypasses the TLB and caches to read virtual address 0xB0000001, which gets translated into physical address 0x1. Register 0 is not modi ed because it always contains the value zero. In practice, we need an extra register to store the constant 0xB0000001. For this reason, we need to save and restore one register. Other escape accesses, however, are more complex. For instance, when we want to record that a new entry is added to the TLB, we have to send to the trace four pieces of information: the index of the TLB entry changed, the number of the virtual page added, the number of its corresponding physical page plus the page status bits (valid, dirty, etc.), and the identi er of the process that owns the page. To transfer this information, we rst read the location that signals TLB entry change: 0xB0000005. Then, one at a time, we take each of the four pieces of information to send, store it in a register, shift the register left one bit and set the least signi cant bit to 1 to make the data odd, set the four most signi cant bits to 0xB to make the address TLB-unmapped and uncachable, and then byte-read from it (Figure 1). Overall, with 5 escape accesses, we have encoded all this information in the trace. While the last four addresses read by the escape access sequence can be anywhere in the address space, the postprocessing program will easily identify and interpret each escape in the sequence. Indeed, after nding a TLB entry change access in the trace, the trace postprocessing program will look for the next four reads from odd addresses by this processor in the trace. Any other cache misses su ered by this processor while sending the escapes are necessarily instruction misses and hence access even addresses only. Therefore, they cannot be confused with escape accesses. Finally, to ensure that the sequence of escape accesses appears on the bus as expected, we have to prevent compiler optimizations and interrupts. The optimizing compiler could potentially reorganize the assembly instructions to improve the pipeline. To disable this e ect, we surround the assembler instructions that we add with the directives .set noreorder and .set reorder. This prevents the compiler from making any changes to the sequence of instructions. Similarly, to guarantee that interrupts do not interrupt a sequence of escapes, we disable interrupts while dumping a sequence of escapes. Disabling and enabling interrupts is very cheap: it only requires 2 or 1 assembly instructions to save or restore the status register. 8

32−bits Code that signals TLB entry change (0x5) 1

TLB entry index

1

Virtual page number

1

Physical page number + status bits

1

ID of the process that owns the page

Mask = 0xB

Figure 1: From top to bottom, sequence of 32-bit addresses that the operating system reads from, to inform the trace bu er of a TLB entry change.

Example of Encoding for the Alliant System

Recall that, in this system, while the trace bu er can gather the address all data accesses, it can only get the address of the instruction accesses that miss in the primary cache. As an example of our instrumentation, we show how we instrument every basic block of the operating system and applications to reconstruct the complete sequence of instructions executed. For this experiment, we add one instruction at the beginning of each basic block. The instruction is add ea,an , which adds the source operand (speci ed as an e ective address, ea ), to the contents of the address register an , and stores the result in the address register an [3, 19]. We chose this instruction because it has low overhead (very close to that of a load instruction) and does not modify any condition codes. In addition, it can be designed so that it does not modify any register either: the source operand is an address belonging to a few pages that we set aside and initialize to zero. Therefore, the address register is not modi ed. Focusing rst on instrumenting the operating system code, we set aside a region of memory to issue the escape accesses. We choose a region in the area that has a predictable TLB mapping. Each basic block will read from a di erent address in this region. These addresses must be aligned to 16 bit boundaries. Given that there are 47; 000 di erent basic blocks in the operating system, it would appear that we would have to set aside 47; 000  2 bytes of memory to generate the escapes. However, one 4-Kbyte page is enough. This is because we generate two types of escape accesses: one when we enter a routine and one when we enter a basic block. The rst one indicates what routine we are in, while the second one indicates the basic block number within a routine. With this technique, therefore, the space overhead can be kept as low as one page. A similar technique is used to instrument the basic blocks of application code. In this case, the escape accesses read from a table in the data area of the application that we set aside and clear. Unfortunately, application addresses su er a virtual to physical translation. For this reason, we dump to the trace bu er the virtual to physical translation of the table. This requires a simple instrumentation of the operating system page fault handler and allows us to reconstruct the virtual addresses. Overall, with this scheme, we can reconstruct the complete sequence of instructions executed by the application with low overhead. The only consideration is that, if several applications are traced concurrently, their tables should all have di erent virtual addresses. Otherwise, we could confuse their virtual to physical translations. To summarize, this section has described a powerful yet unexpensive method to instrument operating system and applications. In addition, it causes very little perturbation. To prove it, we will use as an example the instrumentation-intensive experiment just described for the Alliant 9

System: instrumenting all basic blocks of the operating system and applications to reconstruct the complete sequence of instructions executed. In the following, we rst present the workloads used for the experiment and then examine the perturbation incurred in the system.

5 Workloads Executed For this experiment, we choose three representative operating system-intensive workloads. They exercise di erent parts of the operating system and have di erent average basic block sizes. The workloads are as follows: 





Ccom is a set of runs of the second phase of the C compiler, which generates assembly code given the preprocessed C code. Four instances of this program are run on four directories with 11 C les each. Each le is, on average, 49-Kbyte long. The code for this compiler pass consists of about 15,000 lines of C source code. This workload has both system-intensive and compute-intensive phases. TRFD [5] is a 4-process hand-parallelized version of the TRFD Perfect Club code [8]. The code models the computational aspects of a two-electron integral transformation, and is predominantly composed of matrix multiplies and data interchanges. During execution, the code regularly changes between parallel and serial regimes, while processes synchronize quite frequently. The code consists of about 700 lines of Fortran code and has very large basic blocks. The most important operating system activities incurred through this application are cross-processor interrupts, process synchronization, and gang-scheduling of processes. ARC2D+Fsck is a multiprogramming mix of one run of ARC2D and two runs of Fsck applied to two le systems. ARC2D is a 4-process hand-parallelized version of a Perfect Club code. It models a 2-D uid dynamics code and primarily consists of sparse linear system and rapid elliptic problem solvers. Fsck is a le system consistency check and repair utility. The source code for ARC2D contains about 4,000 lines of Fortran code, while the code for Fsck contains about 4,500 lines of C code and has moderately small basic blocks. The operating system activity for this workload consists of TRFD-like activity for ARC2D , a wide variety of I/O-related code for Fsck , and scheduling activity to change between parallel and serial jobs.

The operating system is also a parallel workload in this study since it is instrumented and its address traces are collected. Its execution characteristics are dependent upon the program that it is running, and are described with each workload.

6 Evaluating the Perturbation Incurred In this section, we show that the method presented to instrument operating system and applications causes very little perturbation. To prove it, we perform an instrumentation-intensive experiment for the Alliant System: we instrument all basic blocks of the operating system and applications to reconstruct the complete sequence of instructions executed. To measure the perturbation, we examine three issues, namely memory overhead, execution speed slowdown, and timing accuracy. We consider each issue in turn. In our experiments, we consider four environments, namely:  

NoInst : neither the operating system nor the applications are instrumented. OSInst : the operating system alone is instrumented.

10

 

APInst : the applications alone are instrumented. OSAPInst : both the operating system and the applications are instrumented.

6.1 Memory Overhead One measure of the perturbation incurred by the trace generation system is the increase in memory requirements. The increase in memory requirements is caused by the addition of escape accesses and the data table the accesses are performed upon. In our system, we add one escape access per basic block. Using the encoding detailed in Section 4, this corresponds to 6 bytes per basic block. With this scheme, therefore, the average size of the basic block determines the relative increase in memory use. In particular, for scienti c applications with large basic blocks the memory overhead will be relatively less important than for system-intensive workloads where basic blocks are small. The static sizes of the codes before and after the instrumentation are shown in Table 1. The table is organized with the codes with the highest relative increase in size at the top of the table. As the table shows, code expansion ranges from a mere 2.1% up to 48.3%. Applications with many branches and therefore small basic blocks su er the largest increases. For example, Ccom and the kernel increase by 48.3% and 39.1% respectively. On the other hand, regular scienti c applications with large basic blocks increase the least. For example, TRFD increases by 2.1% only. Table 1: Static increase in code size due to instrumentation. Workload Ccom Kernel ARC2D Fsck TRFD

Size Before After Increase (Kbytes) (Kbytes) (%) 127.0 188.4 48.3 930.8 1295.1 39.1 162.9 214.4 31.6 69.6 90.1 29.5 99.3 101.4 2.1

Overall, the memory overhead is small. In other instrumentation systems, the expansion induced is much larger. For example, the instrumentation in [14] causes a code expansion of 400%. As a result, our instrumentation does not appreciably change the rate of page faults recorded by the operating system. Furthermore, in a system with a RISC instruction set, the relative overhead could be even smaller. Indeed, in RISC systems, the basic blocks are typically larger because the RISC instructions are less dense than CISC ones [24].

6.2 Execution Speed Slowdown A second measure of the perturbation incurred by the trace generation system is the slowdown caused in the execution speed of the workloads. The slowdown is caused by the execution of the instructions added to generate the escape references plus any e ects that these instructions induce in the hit rate of caches and TLBs and in the paging activity of the operating system. The increase in instructions executed cannot be deduced directly from the data in Table 1, since the table shows static data. Instead, we need to determine the change in dynamic count of instructions. The applications that will su er the worst slowdowns are those that have a small average basic block size over the common execution paths. To measure the execution speed slowdown, we perform timing runs of uninstrumented and 11

instrumented versions of the workloads and kernel. We start with the kernel instrumentation. The e ect of the kernel instrumentation is shown in Table 2. The table shows the execution times for the environment without instrumentation (NoInst ) and with kernel instrumentation only (OSInst ). For each environment, the table shows the user and system time. Starting with the NoInst environment, we see from Column 3 in the table that these are systemintensive workloads. Indeed, the system time accounts for a bit less than 50% of the total time in two workloads and about 10% in the other one. Once we move to OSInst , the system time increases noticeably (Column 6). This is the result of the extra instructions executed and their e ects on the memory hierarchy. The ratio of system times after and before instrumentation is shown in Column 8. From the numbers, we see that the system slowdown ranges between 2.44-3.24. This is indeed a small number and is a good hint that that the behavior of the operating system will not change much. Looking now at the user time (Columns 2 and 5), we see that, as expected, it does not change much for Ccom and TRFD . However, for ARC2D+Fsck , it increases signi cantly. Clearly, instrumenting the kernel a ected the uninstrumented application. The reason is likely to be that the code expansion in the new kernel changed the cache mapping of the kernel code and increased the con ict with the applications. Overall, adding user and system time, the OSInst environment is only 1.13-2.12 times slower than NoInst (Column 9). This is a small slowdown indeed. Table 2: User and system execution times for the NoInst and OSInst environments. Workload Ccom TRFD ARC2D+Fsck

NoInst User System User+ System (s) (s) (s) 47.5 4.5 52.0 22.9 21.4 44.3 61.6 47.3 108.9

OSInst User System User+ System (s) (s) (s) 47.7 11.0 58.7 24.6 69.4 94.0 88.3 126.1 214.4

OSInst /NoInst System User+ System

2.44 3.24 2.67

1.13 2.12 1.97

To see the impact of application instrumentation, Table 3 examines an environment with application-only instrumentation (APInst ) and application plus kernel instrumentation (OSAPInst ). Comparing the user time for NoInst (Column 2 of Table 2) to APInst (Column 2 of Table 3), we see that the user time increases to a variable degree due to application instrumentation. Again, the system time stays approximately constant except for the ARC2D+Fsck workload where con icts occur (Column 3 in Tables 2 and 3). Overall, if we instrument applications only, the workloads run only 1.05-1.84 times slower (Column 6 in Table 3). Table 3: User and system execution times for the APInst and OSAPInst environments. Workload

Ccom TRFD ARC2D+Fsck

APInst

User

System

(s) 68.4 26.0 125.6

(s) 4.7 20.4 75.2

User+ System (s) 73.1 46.4 200.8

APInst / NoInst User User+ System

1.44 1.14 2.04

1.41 1.05 1.84

12

OSAPInst

User

System

(s) 69.4 28.8 137.3

(s) 11.4 94.8 155.7

User+ System (s) 80.8 123.6 293.0

OSAPInst / NoInst User+ System

1.55 2.79 2.69

Finally, Columns 7 and 8 of Table 3 show the user and system time for the OSAPInst environment. Comparing these times to the ones with only application or kernel instrumentation, we see that the only cases where the instrumentation causes interference are ARC2D+Fsck and the system time in TRFD . Overall, as shown in the last column of Table 3, under OSAPInst , the workloads run only 1.55-2.79 times slower than under the non-instrumented environment (NoInst ). This slowdown is substantially smaller than the slowdowns in the other address tracing techniques discussed in Section 2. Indeed, some of those techniques involve slowdowns in the order of 10times while tracing only applications. In addition, our technique will likely involve even smaller slowdowns with RISC processors because our instrumentation is relatively sparser with RISC's less dense instruction encoding.

6.3 Timing Accuracy Given the real time nature of the operating system, even the 1.55-2.79 slowdown caused by our system could change the timing of critical operating system functions in a way to make our trace simulations inaccurate. For example, the slowdown could cause some operations to become relatively less frequent while some other operations become relatively more frequent. If this e ect is signi cant, then our traces would contain invalid information and could not be used for performance evaluation studies. To determine if this is the case, we measured the distribution of operating system routines invoked under the NoInst and OSAPInst environments running the same workload. We then compared the two. If the distributions are signi cantly di erent, then the OSAPInst traces are invalid; otherwise, we have reasonable con dence that the OSAPInst traces model the behavior of the uninstrumented code with enough accuracy. To get the distribution of operating system routines invoked in the OSAPInst environment is easy. Indeed, since we have instrumentation for all the basic blocks of application and operating system, we can determine what routines are being called. Unfortunately, it is not possible to get the same information for the NoInst environment. This is because, without escape accesses, the primary cache intercepts an unknown number of references and the call graph cannot be regenerated from the traces. To solve this problem, we add one escape access for each routine. The resulting instrumentation is tiny, namely 1 instruction per routine invoked. Therefore, we claim that such a system is very close to NoInst . Figure 2 shows the results of our measurements for the OSAPInst environment and the NoInst environment with one escape access per routine. The latter environment we call LowOSInst. In the gure, we have numbered the routines according to the order in which they are laid out in memory. The total number of operating system routines is about 2,500. Then, we count the number of times that each routine is invoked under the LowOSInst (Chart (a)) and OSAPInst (Chart (b)) environments. Finally, we have normalized all the data to 100 invocations. The data in the gure corresponds to the Ccom workload; the other two workloads behave similarly. From the gure, we see that the two environments exhibit similar distributions of routine invocations. Indeed, the height and location of the peaks in the two charts are similar. If the 1.552.79 slowdown induced by the instrumentation had perturbed the behavior, some peaks would have become signi cantly higher than before while others would have become signi cantly lower. This is not what we see in the plots. It is true that there are some small di erences in the peaks marked as 1, 2, 3, 4, which correspond to time-sensitive routines that handle locks and TLB accesses. However, these di erences are small and, overall, the charts look remarkably similar. Therefore, we conclude that the two sets of traces are reasonably close to each other and that the traces with the OSAPInst instrumentation are accurate. 13

8

Number of Invocations (%)

Number of Invocations (%)

8 LowOSInst

6 4 2 0

0

500

1000 1500 Routine Number

2000

2

OSAPInst

4 4

3

2 0

2500

1

6

0

(a)

500

1000 1500 Routine Number

2000

2500

(b)

Figure 2: Operating system routines invoked under an environment with a very lightly-instrumented operating system (LowOSInst) and one where kernel and applications have their basic blocks instrumented (OSAPInst ). The routines are numbered in the X-axes according to the order in which they are laid out in memory. The total number of routine invocations is normalized to 100. The data corresponds to the Ccom workload.

7 Examples of Uses of the Systems The trace generation systems described and evaluated in this paper can be used to gather highlyaccurate information that few other systems can generate. In particular, they can reveal how a multiprocessor operating system uses the memory hierarchy of a shared-memory multiprocessor. Clearly, this information can be used to design faster operating system code and better hardware support. In this section, we give two examples of the issues that these systems can be used for. One use of the systems is to study the performance of a given cache hierarchy. Indeed, with the Alliant System setup, we can simulate di erent cache con gurations and determine the data and instruction cache miss rates or what operating system operations or data structures cause the misses [26, 27]. With the SGI System, we can simulate large caches that include the caches in the real machine and study the misses that they would su er. As an example of the use of the Alliant System, Figure 3 shows the number of references issued to operating system code as a function of the virtual address of the referenced instruction. The plot was generated while running the ARC2D+Fsck workload. Similarly, Figure 4 shows the number of cache misses on operating system code as a function of the virtual address of the referenced instruction. The data corresponds to 8-Kbyte direct-mapped caches with 32-byte cache lines, and was generated while running the ARC2D+Fsck workload. Chart (a) in the gure shows all operating system misses, which are then decomposed into misses caused by self-interference (OSOSMiss, shown in Chart (b)) and misses caused by interference with the application (OSApplMiss, shown in Chart (c)). Misses caused by rst-time references are negligible. A possible second use of the systems is to gather more detailed information on the structure of the operating system routines and basic blocks and then perform compiler optimizations on the operating system code or data. For example, we can determine the basic blocks that form the loops in the operating system. We can then generate data on the sizes of the loops and the average number of iterations present in loops. This information can be used to optimize the layout of the operating system code in memory to minimize cache con icts [27]. As an example, Figure 5 shows the characteristics of the operating system loops that do not call any subroutine. The data was generated while the Ccom workload was running. The leftmost chart in the gure shows a distribution of these loops according to how many iterations are executed per loop invocation, while the rightmost chart shows a distribution of these loops according to their static size. 14

Number of References

6

6

x 10

ARC2D+Fsck 4 2 0

0 256 512 768 Address (Kbytes)

Figure 3: Number of references to operating system code as a function of the virtual address of the referenced instruction. The plot was generated while running the ARC2D+Fsck workload. 4

Number of Misses

12.5

4

x 10

12.5

OS

10

4

x 10

12.5

OSOS

10 7.5

7.5

5

5

5

2.5

2.5

2.5

0

0

256 512 768 Address (Kbytes)

0

0

256 512 768 Address (Kbytes)

(a)

OSAppl

10

7.5

0

x 10

0

256 512 768 Address (Kbytes)

(b)

(c)

Figure 4: Number of misses on operating system code as a function of the virtual address of the

Iterations/Invocation

8

|

0

|

0

|

221300

|

141220

|

81140

|

41-80

0|

20

21-40

|

24

500

|

100500

26-50

|

51100

2 |

11-25

7-10

3-6