Real-time Scheduling on Multithreaded Processors - CiteSeerX

Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany [email protected]

Abstract This paper investigates real-time scheduling algorithms on upcoming multithreaded processors. As evaluation testbed we introduce a multithreaded processor kernel which is specifically designed as core processor of a microcontroller or system-on-a-chip. Handling of external realtime events is performed through multithreading. Real-time threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs). Our proposed microcontroller supports multiple ISTs with zero-cycle context switching overhead. We investigate the behavior of fixed priority preemptive, earliest deadline first, least laxity first and guaranteed percentage scheduling with respect to multithreaded processors. Our finding is that the strategies GP and LLF result in a good blending of instructions of different threads thus enabling a multithreaded processor to utilize latencies best. Assuming a zero-cycle context switch LLF performs best, however implementation cost are prohibitive.

1 Introduction The target market of our project is the wide-spread market of embedded systems, in particular, embedded realtime systems. In this area microcontrollers are typically preferred over general-purpose processors because of their on-chip integration of RAM memory and peripheral controllers, resulting in smaller and cheaper hardware. The execution performance is not the main criterion for microcontrollers. Additionally, support for real-time event handling, rapid context switching ability, and small memory requirements are also essential. Rapid context switching is a basic feature of the multithreaded processor technique, which is investigated since a couple of years for its latency utilization ability. Recently several multithreaded processors were announced

U. Brinkschulte, C. Krakowski Institute for Process Control, Automation and Robotics University of Karlsruhe D-76128 Karlsruhe, Germany [email protected]

by industry. A multithreaded processor is able to pursue multiple threads of control in parallel within the processor pipeline. The functional units are multiplexed between the thread contexts. Most approaches store the thread contexts in different register sets on the processor chip. Latencies that arise by cache misses, long running operations or other pipeline hazards are masked by switching to another thread. Multithreaded processors are able to bridge these latencies efficiently if there are enough parallel executable threads as workload and if the time necessary for switching of threads is very small. In consequence, recent announcements of high-performance processors by industry concern a 4-threaded Alpha processor of DEC/Compaq [1] and Sun’s MAJC-5200 processor which features two 4threaded processors on a single die [2]. Both processors are designed as high-performance processors and will not be suitable for low-cost embedded systems. Contemporary microprocessors and microcontrollers activate Interrupt-Service-Routines (ISRs) for event handling. Events of different priorities are handled by ISRs with appropriate priorities. The generally used priority scheme is fixed priority preemptive (FPP). This priority scheme has several disadvantages: Moreover, fixed priority preemptive scheduling restricts the processor charge to less than 100% with a worst case of only 70% [3]. Furthermore, handling of events of lower priority may be blocked for a longer amount of time by events of higher priority. This forces the programmer to keep ISRs as short as possible and to move work outside the ISR. If several concurrent time-critical events must be handled by ISRs, the resulting programs are complex and hard to test. The fast context switch of multithreaded processors offers a solution to the problem of real-time event reactions in microcontrollers used in embedded systems. Requirements are a design with preferably zero-cycle context switch overhead and hardware support for priority schemes. These techniques are particularly effective if the hardware scheduler is embedded deeply in the processor pipeline and able

unit (MEM) and an execution unit (ALU). Four stack register sets are provided on the processor chip. A signal unit triggers IST execution on the occurrence of external signals.

to schedule on an instruction-per-instruction basis. Instruction scheduling is determined first by the priority scheme and second by latency bridging. I.e. if the priority scheme schedules an instruction that cannot be fed into the pipeline because of a control, data or structural hazard arising from previously issued instructions, then an instruction of a lower priority thread may be issued instead. Our Komodo project [4] explores the suitability of multithreading techniques in embedded real-time systems. We propose multithreading as an event handling mechanism that allows efficient handling of simultaneous overlapping events with hard real-time requirements. We design a microcontroller with a multithreaded processor core that allows to trigger so-called Interrupt-Service-Threads (ISTs) instead of ISRs for event handling. Our Komodo microcontroller features a zero-cycle context switch overhead and hardware support for priority schemes. Because of its application for embedded systems, the processor core of the Komodo micorcontroler is kept at the hardware level of a simple microcontroller similar to the M68302. Our target architecture is a simple pipelined processor kernel which is able to issue one instruction per cycle. Recently, multithreading has also been proposed for event-handling of internal events ([5], [6], [7]) in future high-end processors applying one or more threads for exception handling executing these threads simultaneously to the main thread that caused the exception. However, the fast context switching ability of multithreading has rarely been explored in context of microcontrollers for handling of external hardware events. Besides our own approach, the EVENTS mechanism [8] proposes a FPGAbased processor-external hardware scheduler that triggers context switches in a single or in multiple multithreaded MSparc processors [9]. This paper investigates real-time scheduling algorithms suitable for multithreaded processors and presents performance evaluations on our evaluation testbed—a multithreaded Java microcontroller called Komodo. Section 2 shortly describes the proposed Komodo microcontroller. Section 3 focuses on the evaluation of the different priority schemes assuming a hardware implementation in the Komodo core processor pipeline.

instruction fetch address PC1 PC2 PC3 PC4

memory interface

instructions

IW1 IW2 IW3 IW4

priority manager micro-ops ROM

data path

signal unit

extern signals

instruction decode

MEM

ALU

stack register sets

Figure 1. Block diagram of the Komodo microcontroller The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread active/suspended), each PC is assigned to a different thread. Four byte portions are fetched over the memory interface and put in the according instruction window (IW). Several instructions may be contained in the fetch portion, because of the average bytecode length of 1.8 bytes. Instructions are fetched depending on the fill levels of the IWs, which is sufficient as instruction fetch strategy [11]. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority) and counters. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. We define several priority schemes to handle realtime requirements. In detail, we implemented the fixed priority preemptive (FPP), the earliest deadline first (EDF), the least laxity first (LLF), and the guaranteed percentage (GP) scheduling schemes. The priority manager applies one of the implemented thread priority schemes for IW selection. However, latencies may result from branches or memory accesses. To avoid pipeline stalls, instructions from other threads than the highest priority threads can be fed into the pipeline. The decode unit predicts the latency after such an instruction, and proceeds with instructions from other IWs. There is no overhead for such a context switch. No save/restore of registers or removal of instructions from the pipeline is needed, because each thread has it’s own stack register set. Thus a high priority IST may execute with full speed, lower priority ISTs and non-real-time threads are restricted to the latency pipeline slots of the high priority IST. A bytecode instruction is decoded either to a single micro-op, a sequence of micro-ops, or a trap routine is

2 The proposed Komodo microcontroller The Komodo microcontroller [10] is a multithreaded Java microcontroller which supports multiple ISTs with zero-cycle context switching overhead and several priority schemes. Because of its application for embedded systems, the processor core of the Komodo microcontroller is kept at the hardware level of a simple scalar processor. As shown in Fig. 1, the four stage pipelined processor core consists of an instruction-fetch unit, a decode unit, a memory access 2

called. Each opcode is propagated through the pipeline together with its thread id. Opcodes from multiple threads can be simultaneously present in the different pipeline stages. The instructions for memory access are executed by the MEM unit and all other instructions are executed by the ALU unit. Finally, the result is written back to the stack register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST is activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. In our current implementation, the Komodo microcontroller holds the contexts of up to four threads, which are directly mapped to hardware threads. Three threads may be real-time threads, all remaining threads must be non realtime and are scheduled within the fourth hardware thread. To scale up for larger systems with more than three realtime threads, we propose a parallel execution on several microcontrollers connected by a middleware platform called OSA+ [4]. Because of the unpredictability of cache accesses, a noncached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. Therefore a cache is omitted from our Komodo microcontroller. The Komodo processor is software simulated and hardware implemented on a Xilinx FPGA yielding chip-space requirements of about 55000 gates for a four-threaded processor kernel [12].

ing latencies. Programm Impulse counter PID-regulator FFT

Bytecodes 9 4 445 3 288 484

Load/Store 22,2 11,3 13,2

% % %

Branches 11 7,8 6

% % %

Table 1. The benchmark programs In the first part of the evaluation we executed four equal programs on the processor. In this first experiment, all four threads were given the same real-time parameters (deadline = period, starting processor utilization 1 = 0.25 for each thread). Then the common deadline is shortened until the scheduler can’t keep them any more. Under FPP all threads obtain the same priority, under GP each thread obtains of the available computing time. So the result is a performance indication of the scheduling technique. The results of the different schedulers are compared in figure 2. Here the presentation is scaled to a non multithreaded processor, i.e. a value of 1 corresponds to the performance of a processor, that uses no latencies, but needs no additional clock cycles for a context switch as well.

25%

1,70

1,67

1,67

1,67

1,67

FPP EDF LLF GP

1,60 1,50

speed-up

1,40 1,29

1,30

1,26 1,19

1,20

1,26

1,24

1,30

1,24

1,19

1,10 1,00 0,90

3 Evaluation

0,80

IC

In the following section we evaluate the time behavior and the latency slot use of real-time scheduling strategies on a multithreaded processor. We examine the four scheduling techniques Earliest Deadline First (EDF), Least Laxity First (LLF), Fixed Priority Preemptive (FPP) and Guaranteed Percentage (GP). For that we choose real application programs which are typical for real-time systems as benchmarks. The first program is a simple impulse counter (IC) which reads data from an interface, scales it and stores it in the memory. The other two programs are a PID-element (PID) and a rather costly Fast Fourier Transform (FFT). Our testbed is the Komodo microcontroller with four hardware threads and a zero-cycle context switch. Latencies from memory accesses and branches are bridged by instructions of other than the high priority thread. Table 1 specifies the size of the three measurement programs and the rate of memory accesses and branches caus-

PID

FFT

Figure 2. Speed-up of the computation times of different schedulers with the same threads

The multithreaded processor increases the speed-up for all benchmark programs and scheduling schemes and thus enhances the possible sample rates. All scheduling strategies provide the same speed-up for the impulse counter (IC). This is explained with the extreme shortness of the IC program, that doesn’t allow to demonstrate the differences between the scheduling strategies. Morecinteresting are the PID element and the FFT that yield different speed-ups wrt. the scheduling strategies. The differences are caused by the following behavior that is typical for multithreaded processors: The performance gain of 1 processor utilization = execution time without latency utilization = deadline

3

context switches

a multithreaded processor arises from the utilization of instruction latencies by switching context to instructions of another thread. To be effective, a pool of executable instructions of different threads must be present. Techniques like FPP or EDF tend to lessen this pool, because first the most urgent thread is executed, then the second most urgent thread, etc. The most prior thread executes as on a non multithreaded processor, which allows to compute the worst case execution time as usual. Figure 3 depicts this behavior for an EDF (of FPP) scheduling of four threads with the same code. Let us assume, all four threads start at time zero. Up to time t1 the most prior thread is executed with highest priority and the other three threads are ready for execution. To utilize the instruction latencies that arise in the execution of the most prior thread, the processor can switch to instructions of one of the other three threads. However, after the time t1 there are just two, after t2 there is just one, and after t3 there is no thread left for the use of latencies arising from the last running thread. LLF and GP perform better than FPP and EDF for the PID and FFT programs. Figure 4 shows for LLF scheduling, that all threads keep executable instructions until all threads terminate simultaneously. However, a frequent number of context switches is induced by the equal deadlines and the permanently changing least laxities. This provides an instruction mix that keeps the threads alive for a maximum of time and so creates optimal conditions for the use of latency slots on the multithreaded processor. GP creates a similar behavior by its frequent context switches. Figure 5 shows the frequency of context switches caused by the different strategies. context switches

d4 T4

number of context switchs (% cycles)

Figure 4. Four equal threads with LLF scheduling

60,00

64,01

51,67

50,00

76,47

63,17

50,00

40,00 30,00 20,00 10,00

8,33

8,33

0,03

0,00

0,03

PID

0,00

0,00

FFT

additional non real-time thread. We assume that the deadlines equal the periods and a starting processor utilization of 0.3 for each of the real-time threads. We fix the deadlines for the impulse counter and the FFT and shorten the deadlines for the PID element until the first missed deadline occurs. The priorities for FPP are assigned under the terms of rate monotonic analysis. The implementation of GP on the Komodo microcontroller defines three priority classes for GP: exact, minimal, and non real-time. Class Eexact causes a thread to meet the requested percentage exactly, not more and not less. In case of minimal, a thread gets at least the requested percentage, but it may get more as well. Therefore, in the case of the GP the impulse counter and FFT belongs to the class exact and the PID element is in the class minimal. The start conditions are 30% of execution time for every real-time thread and 10% for the non realtime thread. Figure 6 shows the results of our experiment. It can be seen that again all scheduling algorithms profit from the multithreaded processor. It is remarkable that in this experiment LLF doesn’t perform better than FPP or EDF. This can be explained by the mixture of the threads, too. Due to the highly differing execution times of the threads and corresponding deadlines, the thread with the least laxity is the same over a long period. This leads to a similar behavior of LLF and EDF, resulting in nearly the same number of context switches for LLF and EDF (see figure 7).

d4 t t3

70,00

77,20

FPP EDF LLF GP

Figure 5. Context switches of different schedulers

T4 3 2 t2 threads ready

80,00

IC

d3

t1

t

4 threads ready

T3

4

d3

T3

d2

T2

d2

T2

d1

T1

d1

T1

1

Figure 3. Four equal threads with EDF scheduling From these considerations we conclude as requirements for an optimal real-time scheduler on a multithreaded processor that is able to utilize instruction latencies: The scheduler must sustain each thread as long as possible, i.e. up to its deadline. On condition of a zero cycle context switching overhead, it is a quality factor for a good scheduler that a high number of context switches is caused. Thereby an instruction mix is created which keeps the threads alive as long as possible. The second experiment uses all three programs and an 4

1,8

1,70

1,70

threads always decreases at the end of each interval and even though the number of context switches is high, the mixture of threads is poor. This observation leads to the conclusion, that many context switches are only a hint for a good scheduling algorithm on a multithreaded processor, but not a fact. As seen, our current implementation of the GP scheduler has still some drawbacks and the selection of threads within an interval may be improved.

1,70

1,6 1,4

1,32

speed-up

1,2 1 0,8 0,6 0,4 0,2 0

FPP

EDF

LLF

exact (FFT) exact (IC) minimal (PID) non real-time

GP

Figure 6. Speed-ups of the workload with mixed applica-

0

60

100

cycles

tion programs

number of context switchs (% cycles)

Figure 8. Thread execution within an interval 16

Another essential point is the overhead introduced by the various scheduling techniques. To reach a zero cycle context switch on a multithreaded processor, the scheduler must decide within a single processor cycle which instruction to issue next. The prototype implementation of the Komodo microcontroller in a FPGA showed that FPP generates the by far smallest implementation cost. Second with similar costs range GP and EDF. The highest implementation cost is introduced by LLF. GP and LLF profit by the ability of fast context switching yielding good performance results when assuming a zero cycle context switching overhead. These strategies produce a high number of context switches which allows an excellent blending of threads and therefore an optimal latency utilization. However, the performance of these strategies deteriorates quickly, when context switching costs increase.

14,95

14 12 10 8 6 4,15

4,15

4,17

FPP

EDF

LLF

4 2 0

GP

Figure 7. Number of context switches by different threads

The behavior of the GP scheduler is unexpected. Actually GP should be an ideal scheduler, because the threads in the class exact are held active until the deadline arrives. A thread that needs 10 msec for execution and has a deadline of 40 msec terminates by a share of 25% exactly at the given deadline. The drawback of GP in this experiment can be seen in the current implementation. The scheduler distributes the shares for the threads in intervals of 100 cycles. In each interval, there is a priority to find the next thread. First of all, the threads of the class exact are scheduled in order of the needed cycles. Accordingly the classes minimal and the non real-time threads are taken for execution. Threads that are blocked or in latencies are excluded from the schedule. Figure 8 shows the principle sequence of executing the four given threads. As you can see, after the termination of the two exact threads (after about 60 cycles pending on the usage of the latencies) only the non real-time thread can utilize the latencies of the PID element. Therefore, the non real-time thread gets much more cycles than by LLF of EDF and the performance for handling real-time events goes down. In this case, the number of executable

4 Conclusions Multithreaded processors with the ability of very fast context switching offer a new challenge to real-time scheduling policies. First, scheduling strategies like EDF, LLF, and GP may be implemented without thread switching overhead. Second, multithreaded processors may switch the context to another thread to increase performance by utilizing latencies caused by memory access or branch instructions. Latency utilization in a multithreaded processor can increase processor performance over 100% compared to a non-multithreaded processor. However, latency utilization is an additional performance gain that cannot be guaranteed for hard real-time event handling. To efficiently utilize latencies, a pool of executable instructions of different threads is needed. Classical realtime scheduling policies like EDF or FPP tend to thin out this pool by executing instructions of a thread block-wise, the most urgent thread first, then the second urgent, and so 5

on. This produces a minimal number of context switches, which is a good choice on conventional processors. On a multithreaded processor, not enough instructions of ready threads may remain to bridge occurring latencies. A realtime scheduling policy optimal in the sense of bridging latencies on multithreaded processors must keep a thread alive as long as possible. This means, the execution time of a thread must be extended to its deadline. So scheduling policies which produce more context switches like LLF or GP behave better. In fact, on a multithreaded processor a high number of context switches can be considered as a sign of quality for a scheduling policy, because a thread mix is produced, which keeps the threads alive as long as possible. But LLF and GP are still not optimal. LLF thins out the thread pool like EDF or FPP in case of strongly different deadlines. GP may be a candidate, but the current implementation produces the same problem. So an optimal policy still has to be found. The work described in this paper is considered as a basis for further research on real-time scheduling on the upcoming generation of multithreaded processors. By modifying the well known real-time scheduling policies, the architectural features of these processors can be used more efficiently.

[9] W. Damm, A. Mikschl. MSPARC: a multithreaded SPARC. Euro-Par’96 Parallel Processings: Second International Euro-Par Conference, Vol II, LNCS 1124, Springer Verlag 1996. [10] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T. Ungerer. A Multithreaded Java Microcontroller for Thread-oriented Real-time Event-Handling. 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT ’99), Newport Beach, Ca., pp. 34-39, October 1999. [11] J. Kreuzinger, M. Pfeffer, A. Schulz, T. Ungerer, U. Brinkschulte, C. Krakowski. Performance Evaluations of a Multithreaded Java Microcontroller PDPTA’00, Las Vegas, Nevada, USA, Vol. 1, pp. 95-99, June 2000. [12] J. Kreuzinger, R. Zulauf, A. Schulz, T. Ungerer, M. Pfeffer, U. Brinkschulte, C. Krakowski. Performance Evaluations and Chip-Space Requirements of a Multithreaded Java Microcontroller To be published.

References [1] Emer, J. Simultaneous Multithreading: Multiplying Alpha’sPerformance. Microprocessor Forum 1999, San Jose, Ca., Oct. 1999. [2] L. Gwennap. MAJC Gives VLIW a New Twist. Microprocessor Report, Vol 13, No. 12, pp. 12–15, September, 1999. [3] C. L. Liu and J. W. Layland. Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment. JACM, 20(1):46–61, 1973. [4] U. Brinkschulte, C. Krakowski, J. Kreuzinger, R. Marston, and T. Ungerer. The Komodo Project: Thread-Based Event Handling Supported by a Multithreaded Java Microcontroller. 25th EUROMICRO Conference, Milano, September 1999. [5] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). ISCA’26 Proceedings, Atlanta, Georgia, Vol 27, No 2, pp. 186-195, May 1999. [6] S. W. Keckler, A. Chang, W. S. Lee, W. J. Dally. Concurrent Event Handling through Multithreading. IEEE Transactions on computers, Vol 48, No 9, pp. 903-916, September 1999 [7] C.B. Zilles, J.S. Emer, G.S. Sohi. The Use of Multithreading for Exception Handling MICRO-32, Haifa, November 1999, 219-229. [8] K. L¨uth, A. Metzner, T. Peikenkamp, J. Risau. The EVENTS Approach to Rapid Prototyping for Embedded Control Systems. Zielarchitekturen eingebetteter Systeme, 14. ITG/GI Fachtagung Architektur von Rechnersystemen, Rostock, 1997.

6