Adaptive Group Scheduling Mechanism using Mobile ... - CiteSeerX

12 downloads 443 Views 2MB Size Report
The volunteer server allocates the tasks to volunteers by using scheduling ..... Ψ is different from Φ in that a peer to peer grid computing system is alive in the ...
Adaptive Group Scheduling Mechanism using Mobile Agents in Peer-to-Peer Grid Computing Environment SungJin Choi, MaengSoon Baik, ChongSun Hwang Dept. of Computer Science & Engineering, Korea University 5-1 Anam-dong, Sungbuk-gu, Seoul 136-701, Republic of Korea {lotieye, msbak, hwang}@disys.korea.ac.kr FAX : +82-2-953-0771 JoonMin Gil Supercomputing Center, Korea Institute of Science and Technology Information (KISTI) [email protected] SoonYoung Jung Dept. of Computer Science Education, Korea University [email protected]

Abstract A peer to peer grid computing is an attractive computing paradigm for high throughput applications. However, both volatility due to autonomy of volunteers (i.e., resource providers) and heterogeneous properties of volunteers are challenging in a scheduling procedure. Therefore, it is necessary to develop a scheduling mechanism to adapt to a dynamic peer to peer grid computing environment. In this paper, we propose a Mobile Agent based Adaptive Scheduling Mechanism (MAASM). The MAASM classifies and constructs volunteer groups to perform a scheduling mechanism according to properties of volunteers such as volunteer autonomy failures, volunteer availability, and volunteering service time. In addition, the MAASM exploits a mobile agent technology to adaptively conduct different scheduling, fault tolerance, and replication algorithms suitable for each volunteer group. Furthermore, we demonstrate that the MAASM improves the performance by evaluating our scheduling mechanism in Korea@Home.

Keywords Adaptive scheduling, volunteer group, mobile agent, peer to peer grid computing

1

1. Introduction A grid computing system is a platform that provides the access to various computing resources owned by institutions by making virtual organization [5, 6]. On the other hand, a peer to peer grid computing system1 is a platform that achieves a high throughput computing by harvesting a number of idle desktop computers owned by individuals (which is called volunteers) on the edge of Internet by means of peer to peer computing technologies [3-7, 9, 12, 14-16]. The peer to peer grid computing systems usually support embarrassingly parallel applications which consist of a lot of instances of the same computation with each own data. The applications are usually involved with scientific problems which need large amounts of processing capacity over long periods of time. In recent years, there has been a rapidly growing interest in peer to peer grid computing systems because of the success of the most popular examples such as SETI@Home [1] and distributed.net [2]. Some studies have been made on a peer to peer grid computing systems which provide an underlying platform: Bayanihan [7], XtremWeb [9], Javelin [12], BOINC [15], Entropia [16], Condor [17], Korea@Home [36], and so on. A peer to peer grid computing environment mainly consists of clients, volunteers, and volunteer servers like Fig. 1. A client is a parallel job submitter. A volunteer is a resource provider which donates its computing resources during idle time. A volunteer server is a central manager which controls submitted jobs and volunteers. A client submits a parallel job to a volunteer server. The job is divided into sub-jobs which have each own input data. The sub-job is called a task in this paper. A task consists of a parallel code and data. The volunteer server allocates the tasks to volunteers by using scheduling mechanisms. Each volunteer executes its task during idle time while continuously requesting data to the volunteer server. When each volunteer finishes the task, it returns the result of the task to the volunteer server. Finally, the volunteer server returns the final result of the job back to the client. A peer to peer grid computing is complicated by heterogeneous capabilities, failures, volatility (i.e., intermittent presence), and lack of trust [3-7, 9, 15, 16, 26] because it is based on desktop computers (i.e., volunteers) at the edge of Internet. Volunteers have various capabilities (i.e., CPU, memory, network bandwidth, and latency), and are exposed to link and crash failures. In particular, volunteers are voluntary participants without any reward for their donation of resources. As a result, they can freely join and leave in the middle of the executions without any constraints. Accordingly, they have various volunteering time (i.e., the time of donation), and a public execution (i.e., the execution of a task as a volunteer) can be stopped arbitrarily on account of unexpected-leave. Moreover, the public executions get temporarily suspended by a private execution (i.e., the execution of a private job as a personal user) because volunteers are not totally dedicated only to public executions. In this paper, we regard these unstable situations as volunteer autonomy failures because they lead to the delay and blocking of the execution of tasks and even partial or entire loss of the executions. Each volunteer have various occurrence rate for volunteer autonomy failures according to its execution behavior. Finally, some malicious volunteers tamper with the 1 A peer to peer grid computing is also called volunteer computing [7], global computing [9, 12, 19], desktop grid computing [16] or public resource computing [15].

2

V0

V

V

1

V3

2

Volunteers

V

V

n- 4

V n- 2

n- 3

V n- 1

• • •

(4)

(4)

Parallel code

Parallel code

(3) (1)

(4) Parallel code

(3) (1) (5)

(5)

Data 0

Data 0

Data 0

Data 1

Data 1

Data 1

Data 1

• • •

• • •

• • •

• • •

Data m - 1

(3)

(3)

(3)

Parallel code

(3)

(1)

(5)

(4) Parallel code

(3)

(3) (1)

(3)

(1) (5)

(1) (5)

(1) (5)

Data 0

Data 0

Data 0

Data 0

Data 1

Data 1

Data 1

Data 1

• • •

Data m - 1

• • •

• • •

• • •

Data m - 1

Data m - 1

Data m - 1

(3)

(3)

(3)

Internet

(4) Parallel code

(5)

Data m - 1 Data m - 1

(3) (3)

(4)

Parallel code

(1) (5)

Data 0

Data m - 1

(4)

Parallel code

(3) (1)

(5)

(4)

(3)

(1) Registration (2) Job submission (3) Task allocation

(4) Task execution (5) Task result return (6) Job result return Management of parallel tasks and volunteers

Parallel code and data distribution

Applications

(2) Results of each volunteer

(6)

Storage Server

Volunteer Server

Clients

Figure 1. Peer to Peer Grid Computing Environment

computation and then return corrupted results. These distinct features make it difficult for a volunteer server to schedule the tasks and manage the allocated tasks and volunteers. In order to improve the reliability of computation and performance in a peer to peer grid computing environment, a scheduling mechanism must adapt to the distinct features which result from the heterogeneous property and volatility of volunteers. A scheduling mechanism is required to classify volunteers into groups which have similar properties (especially, volunteer autonomy failures), and then dynamically apply different scheduling mechanisms to each group. Existing peer to peer grid computing systems [7, 9, 12, 15-17], however, do not provide a scheduling mechanism on a per group basis. In addition, the scheduling mechanism is performed only by the volunteer server in a centralized way. As a result, existing mechanism would suffer from a high overhead to the computation and volunteer server and give rise to the degradation of performance. In this paper, we propose a Mobile Agent based Adaptive Scheduling Mechanism (MAASM) to adapt to a dynamic peer to peer grid computing environment. The MAASM classifies and constructs volunteer groups to perform a scheduling mechanism adaptable to the properties of volunteers such as volunteer autonomy failures, volunteer availability, and volunteering service time. In addition, the MAASM exploits a mobile agent technology to adaptively apply different scheduling, fault

3

tolerance, and replication algorithms to volunteer groups. In the MAASM, mobile agents are distributed to volunteer groups. Then, they conduct the scheduling, fault tolerance, and replication algorithms in a distributed way without the direct control of a volunteer server. Consequently, the volunteer group based scheduling, fault tolerant, replication algorithms can reduce the overhead, improve the performance, and guarantee reliable computations. Finally, we have evaluated our MAASM on the basis of Korea@Home [36-39] and ODDUGI mobile agent system [42, 43]. The evaluation results show that the MAASM completes the more tasks and reduces the overhead to computation and replication as compared to existing scheduling mechanism. The rest of the paper is structured as follows. Section 2 presents the overview of a peer to peer grid computing system and describes why mobile agent is used. Section 3 describes the execution model of mobile agent based peer to peer grid computing system as well as the failure model. Section 4 describes the MAASM in details. Section 5 discusses some issues. Section 6 presents the implementation issues and experimental results. Section 7 reviews related work which has been studied in this area. Section 8 concludes the paper.

2. Background and Motivation 2.1. Existing Peer to Peer Grid Computing Model The execution model of peer to peer grid computing consists of six phases: registration, job submission, task allocation, task execution, task result return and job result return phase like Fig. 2. • Registration phase: Volunteers register their information to a volunteer server • Job submission phase: A client consigns a job to a volunteer server. • Task allocation phase: A volunteer server distributes tasks to the registered volunteers by means of a scheduling mechanism. • Task execution phase: The volunteers execute each task. • Task result return phase: Each volunteer returns the result of its task to the volunteer server. • Job result return phase: The volunteer server returns the final result of the job to the client. In Fig. 2, volunteers Vi (0 ≤ i ≤ n) register volunteering information Ωi (i.e. computing resources properties) to a volunteer server and participate in executions of tasks. If a client consigns a job Γ to a volunteer server, the volunteer server allocates the tasks Γm to volunteers. The volunteer Vi executes a task Γm and then returns a result Rm of execution of the task Γm to its volunteer server. The volunteer server returns a final result R of the consigned job Γ to the client.

4

Volunteers Client

Volunteer Server

V1

V2

Vn-1

Vn

ȳ1 (Volunteer Information) ȳ2



Participation phase

ȳn ȳ n-1

ɪ



ɪ1 (Task)

Job submission phase

ɪ2



Tasks allocation phase

R1 (Task Result)

ɪm-1

ɪm

Tasks Execution phase

R2



Results collection phase

Rm R m-1

R Time

Job return phase

Figure 2. Existing Peer to Peer Grid Computing Model

2.2. Why Mobile Agent? A peer to peer grid computing is complicated by heterogeneous capabilities, failures, volatility, and lack of trust. Therefore, a scheduling mechanism must adapt to a dynamic environment in order to improve performance and the reliability of computation. Existing peer to peer grid computing systems, however, do not provide various scheduling, fault tolerance, and replication algorithms on a per group basis. In addition, their scheduling mechanisms are performed only by the volunteer server in a centralized way. As a result, existing peer to peer systems have high overhead and degradation of performance. To solve these problems, we make use of mobile agent technology. Mobile agent is a software program that migrates from one node to another while performing some tasks on behalf of a user [40-43]. Mobile agent has some benefits as follows [40-43]. 1) A mobile agent can reduce the network load and latency by dispatching the mobile agents including the required services or data to remote nodes. Then the services or data are locally executed at the remote nodes. 2) A mobile agent can solve frequent and intermittent disconnection. Once a mobile agent is dispatched to a destination node, it does not require direct connection with a user any more. Therefore, the mobile agent on behalf of a user is performed asynchronously and autonomously, even though a user (i.e., mobile device) is disconnected from the network. 3) A mobile agent enables dynamic service customization and software deployment because it encapsulates some services or protocols into mobility entity.

5

4) A mobile agent can adapt to heterogeneous environment and dynamic changes because it is computer- and transportindependent and also reacts autonomously according to its current execution environment. In this paper, the mobile agent technology is exploited to make our scheduling mechanism adaptive to a dynamic peer to peer grid computing environment. There are some advantages to make use of mobile agents in a peer to peer grid computing environment. 1) Various scheduling mechanisms can be performed at a time according to the properties of volunteers. For example, these scheduling mechanisms can be implemented as mobile agents (i.e., scheduling mobile agents). After volunteers are classified into volunteer groups, the most suitable scheduling mobile agent for a specific volunteer group is assigned to the volunteer group according to its property. Existing peer to peer grid computing system, however, cannot apply various scheduling mechanisms at a time because only one scheduling mechanism is performed by a volunteer server in a centralized way. 2) A mobile agent can decrease the overhead of volunteer server by performing scheduling, fault tolerance, and replication algorithms in a decentralized way. The scheduling mobile agents are distributed to volunteer groups. Then, they autonomously conduct a scheduling, fault tolerance, and replication algorithms in each volunteer group without the direct control of a volunteer server. Accordingly, the volunteer server does not undergo the overhead any more. 3) A mobile agent can adapt to a dynamical peer to peer grid computing environment. In a peer to peer grid computing environment, volunteers can join and leave at any time. In addition, they have heterogeneous properties such as capabilities (i.e., CPU, storage, or network bandwidth), location, availability, credibility, and so on. These environmental properties change over time. A mobile agent can be performed asynchronously and autonomously coping with the changes. It can also tolerate the volunteer autonomy failures by using migration and replication functionalities that a mobile agent itself provides.

3. System Model We describe a new execution model for mobile agent based peer to peer grid computing model and a failure model for volunteer autonomy failures.

3.1. Mobile Agent based Peer to Peer Grid Computing Model As mentioned in previous section, a mobile agent can adapt to dynamic environmental changes as well as various properties of volunteers. In addition, since mobile agent is executed in a distributed way, it can decrease the overhead of volunteer server. Therefore, we propose the overall execution model that mobile agents are applied to a peer to peer grid computing. The mobile agent based peer to peer grid computing works like the execution model of existing peer to peer grid computing. Several phases, however, work differently like Fig. 3. In the registration phase, volunteers register basic properties such as CPU, memory, OS type as well as additional properties such as volunteering time, volunteering service time, volunteer availability, volunteer autonomy failures, volunteer 6

Volunteers

Volunteer group

Volunteer Server

Client

Volunteer group

Deputy

V1

V1

Vn-1

V2

Vn

V1

Vm-1

Vm



Registration phase

Deputy

ɪ

Job submission phase Tasks allocation phase

S-MA T-MA S-MA

Tasks allocation phase

T-MA

Tasks Execution phase

Time

Task result return R phase Job result return phase





Task result return phase

T-MA

Tasks allocation phase

T-MA

T-MA T-MA

Tasks Execution phase Task result return phase

Figure 3. Mobile agent based Peer to Peer Grid Computing Model

credibility, and so on. In particular, since the additional properties are related with dynamical computation and execution, they are more important than basic properties. In the job submission phase, the submitted job is divided into a number of tasks. The tasks are implemented as mobile agents (i.e. task mobile agents: T-MA). In the task allocation phase, the volunteer server does not perform entire scheduling mechanism any more. Instead, it helps scheduling mobile agents (S-MA) to perform a scheduling procedure. At first, the volunteer server classifies and constructs the volunteer groups according to properties such as location, volunteer autonomy failures, volunteering service time, and volunteer availability. After that, it distributes scheduling mobile agents to volunteer groups according to their properties. Finally, the scheduling mobile agent distributes task mobile agents to the members of its volunteer group. In the task execution phase, the task mobile agent is executed in cooperation with its scheduling mobile agent while migrating to another volunteer or replicating itself in presence of failures. In the task result return phase, the task mobile agent returns each result to its scheduling mobile agent. When all task mobile agents return their results, the scheduling mobile agent aggregates the results and then returns the collected results to the volunteer server. In order to tolerate erroneous results, majority voting or spot-checking mechanism is conducted in cooperation with the volunteer server. In the job result return phase, the volunteer server returns a final result to the client when it receives all the results from the scheduling mobile agents. To put it briefly, main differences between existing execution model and new one are as follows. 1) The new mobile agent

7

based peer to peer grid computing model uses the scheduling and task mobile agents. 2) It uses volunteer groups which are constructed according to dynamic properties of volunteers such as volunteer autonomy failures, volunteering service time, availability, and credibility. 3) Various scheduling, fault tolerance, and replication algorithm are performed simultaneously in a decentralized way.

3.2. Failure Model In a peer to peer grid computing environment, volunteers are connected through Internet, so they are exposed to crash and link failures. In addition, since a peer to peer grid computing is based on voluntary participants, it respects the autonomy of volunteers. In other words, volunteers can leave arbitrarily in the middle of public execution, and they are allowed to execute private execution at any time while interrupting the public execution. The volunteer autonomy failures occur much more frequently than crash and link failures in a peer to peer grid computing environment. Therefore, the volunteer autonomy failures should be specially dealt with while they are distinguished from traditional failures. Moreover, volunteers have various occurrence rate and form of volunteer autonomy failures. Since the heterogeneous occurrence rate and form of volunteer autonomy failures affect the computation directly, so a scheduling mechanism must take into account them in order to get better performance and guarantee a reliable computation. To this end, we firstly define volunteer autonomy failures conceptually. The volunteer autonomy failures are expressed as follows. We use the following notations like Table 1 in order to clarify definition of volunteer autonomy failures. First, we categorize the join and leave patterns of a volunteer. The patterns are categorized into expected join (EJ), expected leave (EL), unexpected join (UJ), and unexpected leave (UL).

Vi Γm ξi Iξi Υ Υst Υtt Vi yξi Vi xξi T[Vi yξi ] Πi πi ,

Table 1. Notations A Volunteer (0 ≤ i ≤ n) A task performed by a volunteer Public execution of a task Γm at Vi Time interval of public execution ξi Volunteering time which is the period when a volunteer is supposed to provide its resources The start time when a volunteer Vi is supposed to provide its resources The termination time when a volunteer Vi is supposed to provide its resources The join event which a volunteer Vi participates in public execution ξi The leave event which a volunteer Vi leaves public execution ξi The time when Vi yξi happens An individual job which is performed by a personal user at Vi Private execution of a individual job Πi The symbol means ”occurs when”

8

EJ , (T [Vi yξi ] = Vi .Υs t) EL , (T [Vi xξi ] = Vi .Υtt ) U J , ((T [Vi yξi ] 6= Vi .Υst ) U L , (T [Vi yξi ] 6= Vi .Υtt ) UJ is categorized into before-unexpected-join U J b , middle-unexpected-join U J m , and after-unexpected-join U J a . Also, unexpected-leave UL is categorized into before-unexpected-leave U Lb , middle-unexpected-leave U Lm , and after-unexpectedleave U La .

U J = {U J b , U J m , U J a } U J b , (T [Vi yξi ] < Vi .Υst ) U J m , (Vi .Υst < T [Vi yξi ] < Vi .Υtt ) U J a , (Vi .Υtt < T [Vi yξi ]) U L = {U Lb , U Lm , U La } U Lb , (T [Vi xξi ] < Vi .Υst ) U Lm , (Vi .Υst < T [Vi xξi ] < Vi .Υtt ) U La , (Vi .Υtt < T [Vi xξi ]) Volunteer autonomy failures (Λ) are classified into volunteer volatility failure (Φ) and volunteer interference failure (Ψ). Volunteer autonomy failures are defined as follows.

Λ = {Φ, Ψ}

Definition 1 (Volunteer volatility failure) Volunteer volatility failure Φ is abortion of public execution which is caused by freely leaving of the public execution ξi of a task Γi .

Φ , T [Vi xξi ] ∈ Iξi The volunteer volatility failure is categorized as follows: unexpected-before Φb , unexpected-middle Φm , expected Φe , and unexpected-after Φa . Φ = {Φb , Φm , Φe , Φa } Φb , (T [Vi xξi ] ∈ Iξi ) ∨ (T [Vi xξi ] < Vi .Υst ) Φm , (T [Vi xξi ] ∈ Iξi ) ∨ (Vi .Υst < T [Vi xξi ] < Vi .Υtt ) Φe , (T [Vi xξi ] ∈ Iξi ) ∨ (T [Vi xξi ] = Vi .Υst ) Φa , (T [Vi xξi ] ∈ Iξi ) ∨ (Vi .Υtt < T [Vi xξi ])

9

Definition 2 (Volunteer interference failure) Volunteer interference failure Ψ is temporary suspension of public execution ξi which is caused by private execution πi of a individual job Πi .

Ψ , (T [πi ] ∈ Iξi ) Volunteer interference failure Ψ is categorized into expected Ψei and unexpected Ψui . Ψei occurs when a private execution interferes with public execution regularly (e.g. reserved virus checking), but Ψui occurs when a private execution which starts from keyboard or mouse movement interferes with execution irregularly (e.g., temporary email checking etc.). Φ and Ψ are different from crash failure in that the operating system is alive in the presence of Φ and Ψ, whereas it shuts down in the presence of crash failure [33, 34, 38]. Φ is different from crash failure in that Φ occurs by the will of volunteer [33, 34, 38]. Ψ is different from Φ in that a peer to peer grid computing system is alive in the presence of Ψ, whereas it is not operating in case of Φ. Φ is related to the completion of public execution. For example, if a leave event arbitrarily happens in the middle of a public execution, this execution gets stopped (or aborted). As a result, the execution is not completed. That is, Φ hinders the completion of execution. Ψ is related to the continuity of public execution. For example, if a private execution is frequently performed by a personal user in the middle of a public execution, the public execution gets temporarily suspended. Consequently, the public execution cannot proceed continuously. That is, Ψ obstructs the continuity of execution.

4. Mobile Agent based Adaptive Scheduling Mechanism The MAASM provides a scheduling mechanism on the basis of volunteer groups. It also exploits mobile agent to adaptively apply different scheduling, fault tolerance, and replication algorithms to each volunteer group. In this section, we firstly illustrate how to construct volunteer group according to the properties of volunteers. Then, we introduce how to apply scheduling, fault tolerance, and replication algorithms to volunteer groups by means of mobile agents. Finally, we illustrate how to manage volunteer groups in case of failures.

4.1. Constructing Volunteer Groups Volunteer group is a set of volunteers which have similar properties such as volunteer autonomy failures, volunteer availability and volunteering service time. In order to apply different scheduling mechanisms suitable for the properties of volunteers in a scheduling procedure, it requires that volunteers be grouped into homogeneous volunteers. At first, we classify volunteers according to their properties. Then, we classify and construct volunteer groups.

10

4.1.1 Classifying Volunteers When volunteers are classified, their CPU, memory, storage, and network capacities are important factors. The most important factors, however, are location, volunteering time, volunteer autonomy failures, volunteer availability, and volunteer credibility in the sense that the completion and continuity of computation and the reliability of results are tightly related with volunteering time and volunteer availability which result from volatility of volunteers and credibility of volunteers like Fig. 4. In a peer to peer grid computing environment, the capacities of desktop computers are almost similar, whereas the volunteering service time, availability, and credibility are very various [8, 26, 36]. In this paper, we concentrate on volunteering service time, volunteer autonomy failures, and volunteer availability when we classify volunteers. This paper is not concerned with the credibility which is related with result certification to detect and tolerate erroneous results. Volunteer Availability devoted

selfish

busy

idle

malicious

Volunteering Service Time

trustworthy

Volunteer Credibility

Figure 4. The classification criteria of volunteers

The volunteering time and volunteer availability is defined as follows. Definition 3 (Volunteering time) Volunteering time (Υ) is the period when a volunteer is supposed to donates its resources.

Υ = ΥR + Υ S

Here, the reserved volunteering time(ΥR ) is the reserved time when a volunteer provides its computing resources. A volunteer mostly performs public execution during ΥR , rarely private execution. On the other hand, the selfish volunteering time (ΥS ) is unexpected volunteering time. Thus, a volunteer usually performs private execution during the ΥS , sometimes public execution. Definition 4 (Volunteer availability) Volunteer availability (αv ) is the probability that a volunteer will be operational cor-

11

rectly and be able to deliver the volunteer services during volunteering time Υ

αv =

M T T V AF M T T V AF + M T T R

Here, the MTTVAF means ”mean time to volunteer autonomy failures” and the MTTR means ”mean time to rejoin”. The MTTVAF means the average time before the volunteer autonomy failures happen and the MTTR means the mean duration of volunteer autonomy failures. The αv reflects the degree of volunteer autonomy failures, whereas the traditional availability in distributed systems is mainly related with the crash failure. MTTVAF and MTTR are recalculated dynamically when a volunteer detects Φ and Ψ. Here, MVT represents ”mean volunteering time”. The symbol ./ represents the combination of two events. The symbol ] represents the union of time intervals. The parameter µ is a weight constant. When a volunteer executes a task, the weight constant is initially set to 1. The weight constant is increased whenever Φ and Ψ happen. The weight constant is reset to 1 when the volunteer finishes its task.

Case 1: U J b , Φb or Φa M T T V AF = M T T V AF + µ ×

MTTR = MTTR − µ ×

MV T = MV T + µ ×

{I(U J b ./ EJ)

{I(U J b ./ EJ)

{I(U J b ./ EJ)

U

U

U

I(U J b ./ Φb ) MV T

I(U J b ./ Φb ) MV T

I(U J b ./ Φb ) MV T

U

U

U

I(EL./ Φa ) }

I(EL./ Φa ) }

I(EL./ Φa ) }

Case 2: U J m or Φm U {I(EJ./ U J m ) I(Φm ./EL) } M T T V AF = M T T V AF − µ × MV T

MTTR = MTTR + µ ×

{I(Φm ./ U J m ) } MV T

U {I(EJ./ U J m ) I(Φm ./EL) } MV T = MV T − µ × MV T

12

Case 3: Ψei or Ψui M T T V AF = M T T V AF − µ ×

MTTR = MTTR + µ ×

MV T = MV T − µ ×

{IΨei ] IΨui } MV T

{IΨei ] IΨui } MV T

{IΨei ] IΨui } MV T

Cases 1 and 2 describe how to calculate volunteer availability in case of volunteer volatility failure and unexpected-join. Case 3 describes how to calculate volunteer availability when volunteer interference failure occurs. The parameter µ is used in order to reflect the rate of volunteer autonomy failures in volunteer availability. For example, if volunteer autonomy failures occur repeatedly and frequently, the volunteer availability drops rapidly. Moreover, the mean volunteering time affects the volunteer availability. For example, if the mean volunteering time is short, the volunteer availability is considerably affected by volunteer autonomy failures. In Case 1, volunteer availability increases because unexpected volunteering time are provided. On the other hand, in Cases 2 and 3, volunteer availability decreases because of volunteer autonomy failures. Volunteers are categorized into region volunteers or home volunteers according to their location. Home volunteers are defined as resource donators at home. Region volunteers are a set of resource donators that generally are affiliated with organizations such as university, institution, and so on. Region volunteers are connected with LAN or Intranet, whereas home volunteers are connected with Internet. Volunteers are categorized into four classes according to Υ and αv like Fig. 5. The A class is a set of volunteers which have long Υ and high αv . The B class is a set of volunteers which have short Υ and high αv . The C class is a set of volunteers which have long Υ and low αv . The D class is a set of volunteers which have short Υ and low αv .

Dv B

A

(Intermediate quality)

(High quality)

D

C

(Low quality)

(Intermediate quality)

b Figure 5. The classification of volunteers

13

4.1.2 Classifying and Making Volunteer Groups A volunteer server selects volunteers as volunteer group members according to the properties of volunteers such as location, volunteer availability, and volunteering service time. Volunteer service time is defined as follows. Definition 5 (Volunteering service time) Volunteering service time (Θ) is the expected service time when a volunteer participates in the public execution during Υ Θ = Υ × αv In a scheduling procedure, Θ is more appropriate than Υ because Θ represents the time when a volunteer actually executes each task in the presence of volunteer autonomy failures Λ. Therefore, volunteer groups are constructed according to Θ, not Υ. If volunteer groups are constructed on the basis of location, region volunteers belong to the same group, and home volunteers are formed into the same group in order to reduce the communication cost between members. When both αv and Θ are considered in grouping the volunteers, the volunteer groups are categorized into four classes like Fig. 6. Here, ∆ is the expected computation time of a task.

'

Dv B’

A’

(Low-intermediate quality)

(High quality)

D’

C’

(Low quality)

(High-intermediatequality)

4 Figure 6. The classification of volunteer groups

Volunteers are classified into four classes: A’, B’, C’ and D’ volunteer groups. If volunteers have the high αv and Θ ≥ ∆, they are involved with the class A’. If volunteers have the high αv and Θ < ∆, they are involved with the class B’. If volunteers have the low αv and Θ ≥ ∆, they are involved with the class C’. If volunteers have the low αv and Θ < ∆, they are involved with the class D’. Volunteer groups are constructed by the algorithm of volunteer group construction like Fig. 7. 1) The registered volunteers are classified into home or region volunteers by their location. 2) The home and region volunteers are classified into A, B, C, and D classes by volunteering time and volunteer availability, respectively. 3) The volunteer groups are constructed according to volunteering service time and the volunteer availability. 14

// To classify the registered volunteers(V ) into home or region volunteers ClassifyVolunteersByLocation(V ); // To classify the home and region volunteers into A, B, C, D classes, respectively ClassifyVolunteers(V ); // To construct volunteer groups if (Vi .Θ ≥ ∆) then // Vi : one of the classified volunteers if (Vi ∈ VA ||Vi ∈ VB ) then // VA : A class, VB : B class Vi → V GA0 ; // → : assign, V GA0 : A’ volunteer group else // Vi ∈ VC ||Vi ∈ VD , here, VC : C class, VD : D class Vi → V GC 0 ; // V GC 0 : C’ volunteer group fi; else // Vi .Θ < ∆ if (Vi ∈ VA ||Vi ∈ VB ) then Vi → V GB 0 ; // V GB 0 : B’ volunteer group else Vi → V GD0 ; // V GD0 : D’ volunteer group fi; fi; Figure 7. Algorithm of volunteer group construction

The volunteer groups have the following properties. The A’ volunteer group has high Θ and high αv enough to execute tasks reliably. It is used as deputy volunteers that host the scheduling mobile agents. The B’ volunteer group has high αv , but low Θ. It cannot complete their tasks because of lack of the computation time. The C’ volunteer group has high Θ, but low αv . It has the time enough to execute tasks. However, volunteer autonomy failures occur frequently in the middle of execution. Therefore, it requires fault tolerant mechanism to execute tasks reliably. The D’ volunteer group has low Θ and low αv . It has no time enough to execute tasks. Moreover, volunteer autonomy failures occur frequently in the middle of execution. Among the volunteer groups, the A’ and C’ volunteer groups mainly execute tasks because of sufficient time. If a task is migrated in the middle of execution, B’ volunteer group can be used as migration places when A’ and C’ are suffering from failures. Otherwise, B’ volunteer group is not appropriate to distribute tasks because its volunteering service time is too short to complete a task. In this case, it executes tasks for test, that is, to measure its properties. D’ volunteer group gives rise to a high management cost because of lack of time as well as low volunteer availability. D’ volunteer group executes tasks only for test, too. If checkpointing is used, B’ and D’ volunteer groups can be used to execute non-critical-time applications.

4.1.3 Maintaining Volunteer Groups The volunteer groups are maintained by means of three mode: task-based, time-based and count-based modes. In the taskbased mode, whenever a task is completed, volunteer groups are built. The time-based mode builds volunteer groups at the regular intervals if the tasks to schedule remain. The count-based mode constructs volunteers groups when the number of participating volunteers is larger than or equal to predefined number k. The k depends on the size of volunteer groups or the number of redundancy. The size of a volunteer group is mainly related with the maintenance cost (i.e., the scheduling and 15

Combination A’D’ & C’B’

A’B’ & C’D’

A’C’ & B’D’

Table 2. The combination of volunteer groups The number of al- αv compen- Θ compenDescription located tasks sation sation The tasks are distributed to each scheduling group. A’D’ ' C’B’ or ° ° A’ compensates for D’, and C’ compensates for A’D’ ≥ C’B’ B’. The tasks are distributed to each scheduling group. A’B’ ' C’D’ or × ° Both C’ and D’ have low αv , so they do not comA’B’ ≥ C’D’ pensate αv . Tasks are mainly distributed to A’C’. Most tasks A’C’ À B’D’ ° × are completed in A’C’. Both B’ and D’ do not compensate Θ.

management cost of task mobile agents, fault tolerance, replication, etc.). The volunteer groups are kept until the scheduling agent cannot distribute tasks to members anymore. For example, if all members have no time enough to execute a task, the volunteer groups are dismissed. The members of volunteer groups are partially replaced by others if a volunteer fails (the details are illustrated in subsection 4.3).

4.2. Allocating Scheduling Mobile Agents to Scheduling Groups After constructing volunteer group, a volunteer server allocates the scheduling mobile agents (S-MA) to volunteer groups. However, it is not practical to allocate S-MAs directly to the volunteer groups in a scheduling procedure because some volunteer groups are not perfect for finishing the tasks reliably. Therefore, it is necessary to build new scheduling groups by combining the volunteer groups with each other like Table 2. In Table 2, the first two combinations are more appropriate than the last one because the tasks are distributed to each scheduling group in the first two combinations, whereas the tasks are mainly distributed to A’C’ scheduling group in the last combination. Besides, in the last combination, even though the tasks are allocated to B’D’ scheduling group, they are not completed because of insufficient time. Between the first two combinations, the first combination is more appropriate than the second one because B’ volunteer group is able to compensate for C’ volunteer group with regard to availability in the first combination, whereas C’ volunteer group dose not compensate for D’ volunteer group in the second combination.2 Therefore, this paper focuses on the first combination in a scheduling procedure. The S-MA is executed at a deputy volunteer. The deputy volunteer is selected by the algorithm like Fig. 8. The deputy volunteers are ordered by volunteer availability and volunteering service time, not to mention hard disk capacity and network bandwidth. Then, the deputy volunteers for scheduling groups are selected sequentially. After that, each S-MA is sent to the selected deputy volunteer. 2 In A’D’ or A’B’ scheduling group, since A’ volunteer group has high availability and enough Θ, A’ volunteer group compensates for D’ volunteer group as well as B’ volunteer group.

16

// DV S : deputy volunteer set // CDV S : candidate deputy volunteer set // T DV S : temporal deputy volunteer set // CDV S ⊂ V GA0 T DV S = OrderedBy(CDV S.αv ); // HC : harddisk capacity, NB : network bandwidth DV S = OrderedBy(T DV S.(Θ + HC + N B)); // Pop the best deputy volunteer from DV S DV = PopBestDV(DV S) Figure 8. Algorithm of deputy volunteers selection

4.3. Distributing Task Mobile Agents to Group Members After the S-MAs are allocated to the scheduling groups, each S-MA distributes the task mobile agents (T-MA) which consist of parallel code and data to the members of scheduling group. The S-MAs perform different scheduling, fault tolerance, and replication algorithms according to the type of volunteer groups differently from existing peer to peer grid computing systems. The S-MA of A’D’ scheduling group performs scheduling algorithm as follows. 1) Order A’ volunteer group by αv and then by Θ. 2) Distribute T-MAs to the arranged members of the A’ volunteer group. 3) If a T-MA fails, replicate the failed task to a new volunteer selected in A’ volunteer group by means of replication algorithm, or migrate it to a volunteer selected in A’ or B’ volunteer group if task migration is allowed. The S-MA of C’B’ scheduling group performs scheduling algorithm as follows. 1) Order C’ and B’ volunteer groups by αv and then by Θ. 2) Distributes T-MAs to the arranged members of C’ volunteer group. 3) If a T-MA fails, replicate the failed task to a new volunteer selected in the ordered C’ volunteer groups, or migrate it to a volunteer selected in the B’ or C’ volunteer group. Tasks are firstly distributed to A’D’ scheduling group and then C’B’ one. In addition, the tasks are firstly distributed to the volunteers which have high αv and long Θ. In the scheduling algorithm, if checkpointing is not used, tasks are not allocated to B’ and D’ volunteer groups, because they have no time enough to finish the task reliably. In this case, B’ and D’ volunteer groups execute tasks for test, that is, to measure their properties. For example, the tasks executed in A’ and C’ volunteer groups are redistributed to D’ and B’ volunteer groups, respectively. However, B’ volunteer group can be used to assist the main volunteer groups (i.e., A’ or C’) if task migration is allowed. For example, in the C’B’ scheduling group, B’ volunteer group can be used to compensate for C’ volunteer group with regard to volunteer availability. Suppose that a volunteer in C’ volunteer group suffers from volunteer autonomy failures. If the volunteering time of a volunteer in B’ volunteer group implies the duration of volunteer autonomy failures at the failed volunteer, the suspended task can migrate to the new volunteer in B’ volunteer group. If replication is used, a S-MA calculates the number of redundancy and then selects replicas (i.e., volunteers to execute the 17

replicated computation). After that, the S-MA distributes the T-MAs to the selected replicas. In case of failures, the S-MA replicates or migrates the failed T-MA to a new volunteer. The replication and fault tolerance algorithms are described in details in 4.4, 4.5 subsection, respectively.

4.4. Applying Adaptive Replication Algorithm Replication is a well-known technique to improve reliability and performance in distributed systems [33, 34]. In a peer to peer grid computing environment, replication is mainly used for reliability to tolerate failures or for result certification to detect and tolerate erroneous results [8, 11, 15, 28-32]. This paper focuses on replication for reliability to tolerate volunteer autonomy failures. Our adaptive replication algorithm automatically adjusts the number of replication, and selects appropriate replica according to the properties of each volunteer group.

4.4.1 How to calculate the number of redundancy If replication is used, each S-MA calculates the number of redundancy to its volunteer group, respectively. It exploits volunteer autonomy failures, volunteer availability, and volunteering service time simultaneously when calculating the number of redundancy. In a peer to peer grid computing environment, volunteer autonomy failures occur much more frequently than crash and link failures. In addition, volunteers have various rate and form of volunteer autonomy failures. Therefore, we must calculate the number of redundancy on the basis of volunteer groups which have similar rate and form of volunteer autonomy failures in order to reduce the overhead of replication. However, existing replication algorithms [8, 11, 15, 28-32] do not consider volunteer group based replication algorithm. Our adaptive replication algorithm makes use of volunteer autonomy failures, volunteer availability, and volunteering service time as follows. The number of redundancy r for reliability is calculated by Eq. 1. Here, we assume that the lifetime of a system is exponentially distributed [33-35]. Here, τ represents MTTVAF of volunteer, and τ 0 represents MTTVAF of volunteer group. 0

1 − (1 − e−∆/τ )r ≥ γ

(1)

The parameter γ is the reliability threshold. τ 0 = (V0 .τ + V1 .τ + · · · + Vn .τ )/n

Here, n is the total number of volunteers within a volunteer group. The Vn .τ means τ of a volunteer Vn . ∆

In Eq. 1, the expression e− τ 0 3 If

3

represents the reliability of each volunteer group which means probability to complete the 0

the lifetime of a volunteer is exponentially distributed, then the reliability of the volunteer R(t) is : R(t) = e−λ

18

t

[32-34]. The parameter λ0



tasks within ∆. It reflects the volunteer autonomy failures. The (1 − e− τ 0 )r means the probability that all replicas fail to complete the replicated tasks. The value of r is calculated from Eq. 1 if the required reliability γ is given. Each volunteer group has the different number of redundancy. For example, the A’ and C’ volunteer group has the smaller number of redundancy than B’ volunteer group.

4.4.2 How to distribute T-MAs to replicas The methods to distribute tasks to replicas are categorized into two approaches: parallel distribution and sequential distribution like Fig. 9. V0

Tm

Tm+1

Tm+2

V0

Tm

Tm+1

Tm+2

V1

Tm

Tm+1

Tm+2

V1

Tm+1

Tm+2

Tm

V2

Tm

Tm+1

Tm+2

V2

Tm+2

Tm

Tm+1

(a) Parallel distribution

(b) Sequential distribution

Figure 9. Parallel and sequential distribution

In Fig. 9, the replicas consist of volunteers, V0 , V1 and V2 (that is, r = 3). In the parallel distribution, the task (Tm ) is distributed to all members at the same time like Fig. 9 (a), and then executed simultaneously. On the other hand, the task (Tm ) is distributed and then executed sequentially like Fig. 9 (b). In case of A’ volunteer group, sequential distribution is more appropriate than parallel one because the former can complete more tasks. For example, in Fig. 9 (b), if V0 completes the task Tm , there is no need to execute it at V1 and V2 . A’ volunteer group has high possibility to execute a task reliably without failures (especially, volunteer autonomy failures) because of high volunteer availability. However, if A’ volunteer group performs parallel distribution like Fig. 9 (a), it exhibits overhead of replication in the sense that the volunteers execute only the same tasks even though it is able to execute other tasks. In contrast to A’ volunteer group, in case of C’ volunteer group, sequential distribution is more appropriate than parallel one because C’ volunteer group frequently suffers from volunteer autonomy failures owing to low αv .

4.5. Handling Failures Volunteer autonomy failures lead to the delay and blocking of the execution of tasks. They occur much more frequently than crash and link failures in a peer to peer grid computing environment. Moreover, volunteers take a various occurrence rate and form of volunteer autonomy failures. A peer to peer grid system is required to conduct different fault tolerance represents the rate of volunteer autonomy failures in this paper. If we calculate the probability that tasks is completed at time interval ∆, we get the e because λ10 = τ 0 .

19

− ∆0 τ

algorithms in scheduling procedures according to the occurrence rate and form. To do so, we apply different fault tolerance algorithms according to the property of each volunteer group while distinguishing volunteer autonomy failures from the traditional failures. We describe how the scheduling mobile agent and task mobile agent works in the presence of failures in the subsection. The volunteer autonomy failures Φ are different from crash failure in that the operating system is alive in spite of volunteer volatility failure Φ and volunteer interference failure Ψ, whereas it shuts down in the presence of crash failure [33, 34, 38]. Φ is different from crash failure in that Φ occurs by the will of volunteers [33, 34, 38]. Ψ is different from Φ in that a peer to peer grid computing system is alive in spite of Ψ, whereas it is not operating in case of Φ. The volunteer server detects the crash failure of S-MA by using timeout. Similarly, the S-MA detects the crash failure of T-MA. To do so, the S-MA sends alive message to its volunteer server. Similarly, the T-MA sends alive message to the S-MA. The T-MAs in D’ volunteer group do not send alive message in order to reduce the management overhead. A volunteer can detect volunteer autonomy failures by oneself because its operating system does not shut down. If T-MA or S-MA detects the autonomy failures, it notifies its S-MA or volunteer server of them, respectively.

4.5.1 Failure of S-MA A S-MA rarely suffer from the volunteer autonomy failures because it is executed at the deputy volunteers which are selected among A’ volunteer group. The S-MA stores information such as scheduling group lists, scheduling table, and task results in a stable storage. If the S-MA fails, the information is sent to a new deputy volunteer. Fig. 10 shows the fault tolerant algorithm of S-MA. If a volunteer server detects the crash failure of S-MA, the new deputy volunteer is selected by the algorithm of deputy volunteers selection like Fig. 8. After that, the S-MA and the scheduling information are sent to the newly selected deputy volunteer. If a S-MA suffers from the volunteer volatility failure, it sends VolatilityFailure message to the volunteer server. If the S-MA joins again during the volunteering time, it sends Rejoin message to its volunteer server. If the volunteer server does not receive Rejoin message within the interval after receiving VolatilityFailure message, it sends the S-MA to a new deputy volunteer. If a S-MA is at the edge of reserved volunteering time, it sends InAdvanceVolatilityFailure message to its volunteer server. In that case, the volunteer server responds with a candidate deputy volunteer. The S-MA migrates to the candidate deputy volunteer. In case of volunteer interference failure, a S-MA does not take any actions because it can perform scheduling procedure in the sense that the peer to peer grid system is alive.

20

[ If Volunteer server(VS) detects the crash failure of S-MA ] Vm = SelectDeputyVolunteer(A’); SendS-MA’(Vm ); // send S-MA’(the latest checkpointed S-MA) to Vm [ If Φ occurs ] // At the S-MA side S-MA’ = Checkpoint(S-MA); Send VolatilityFailure message to VS; Send S-MA’ to VS; // At the volunteer server side if (VS is informed of the volunteer volatility failure) then if (VS does not received the Rejoin message within the interval) then Vm = SelectDeputyVolunteer(A’); SendS-MA’(Vm ); fi; fi; [ At the edge of volunteering time ] // At the S-MA side S-MA’ = Checkpoint(S-MA); Send InAdvanceVolatilityFailure message to VS; if (S-MA receivs the candidate deputy volunteer Vm ) then MigrateS-MA’(Vm ); fi; // At the VS side if (S-MA receives InAdvanceVolatilityFailure message) then Vm = SelectDeputyVolunteer(A’); SendCandidateDeputyVolunteer(Vm ); fi;

Figure 10. Fault tolerant algorithm in presence of failures of S-MA

4.5.2 Failure of T-MA A T-MA suffers from volunteer autonomy failures more frequently than a S-MA, because it has low availability relatively. The T-MA checkpoints the execution state at the rate of M T T V AF if checkpointing is used. Fig. 11, 12 and 13 show the fault tolerant algorithm of T-MA. If a S-MA detects the crash failure of T-MA, it selects a new volunteer. If checkpointing is used, the S-MA sends the latest checkpointed T-MA’ to it. Otherwise, the S-MA redistributes the T-MA to the new one. Each S-MA redistributes the T-MA within the number of redundancy r. If a T-MA is at the edge of reserved volunteering time, it sends InAdvanceVolatilityFailure message to its S-MA. After receiving a candidate volunteer, it migrates to the candidate volunteer or is replicated. If a T-MA suffers from volunteer volatility failure Φ, it takes a checkpoint of the execution of task and then notifies its S-MA of Φ by means of VolatilityFailure message. After that, if the S-MA does not receive any Rejoin message from the failed volunteers within predefined time interval, it reschedules the T-MA. If checkpointing and migration are used, the S-MA migrates the T-MA’ to a new volunteer. Otherwise, the S-MA replicates the T-MA by the number of redundancy r. If a T-MA suffers from volunteer interference failure Ψ, it takes a checkpoint of the execution. Then, if the execution is not restarted within the interval, the volunteer sends InterferenceFailure message to its S-MA. After receiving a candidate

21

[ If S-MA detects the crash failure of T-MA ] Vm = SelectNewVolunteer(); if (checkpointing is used) then SendT-MA’(Vm ); // send T-MA’(the latest checkpointed T-MA) to Vm else RedistributeT-MA(Vm ); // redistribute T-MA to Vm fi; [ At the edge of reserved volunteering time ] // At the T-MA side if (the task is not finished) then if (T-MA ∈ A’ || T-MA ∈ C’ || T-MA ∈ B’) then Send InAdvanceVolatilityFailure message to S-MA; fi; if (T-MA receives the candidate volunteer Vm ) then if (checkpointing is used) then MigrateT-MA’(Vm ); // migrate T-MA’ to Vm else ReplicateT-MA(Vm ); // replicate T-MA to Vm fi; fi; fi; // At the S-MA side if (S-MA receives InAdvanceVolatilityFailure message) then Vm = SelectNewVolunteer(); SendCandidateVolunteer(Vm ); fi; [ Function : SelectNewVolunteer() ] if (T-MA ∈ A’ volunteer group) then Vm = SelectNewVolunteer(A’); else if (T-MA ∈ C’ volunteer group) then Vm = SelectNewVolunteer(C’, B’); else if (T-MA ∈ B’ volunteer group) then Vm = SelectNewVolunteer(B’); SendT-MA(Vm ); fi;

Figure 11. Fault tolerant algorithm in presence of failures of T-MA (1)

volunteer, the T-MA migrates to the candidate volunteer or is replicated. In the algorithm, there is no fault tolerant mechanism for D’ volunteer group in the presence of failures during the execution in order to reduce management overhead. D’ volunteer group executes the task for test, for example, for the purpose of recalculating volunteer autonomy failures, volunteer availability, and volunteering service time.

5. Discussion Our MAASM adapts to a dynamic peer to peer grid computing environment. It performs different scheduling, fault tolerance, and replication algorithms on the basis of volunteer group. It gains better performance than existing scheduling mechanisms. The advantages are described as follows. 1) The MAASM is based on volunteer groups which are constructed by volunteer autonomy failures, volunteering service time, and volunteer availability whereas existing scheduling mechanisms are based on individuals. It can apply various

22

[ If Φ occurs ] // At the T-MA side T-MA’ = Checkpoint(T-MA); if (T-MA ∈ A’ || T-MA ∈ C’ || T-MA ∈ B’) then Send VolatilityFailure message to S-MA; Send T-MA’ to S-MA; fi; // At the S-MA side if (S-MA receives VolatilityFailure message) then if (S-MA does not receive the Rejoin message within the interval) then Vm = SelectNewVolunteer(); if (checkpointing is used) then SendT-MA’(Vm ); else ReplicateT-MA(Vm ); fi; fi; fi;

Figure 12. Fault tolerant algorithm in presence of failures of T-MA (2)

scheduling algorithms to the volunteer groups at a time according to their properties. 2) The MAASM newly classifies and defines volunteer autonomy failures conceptually. It also provides a fault tolerance algorithm on a per group basis while distinguishing volunteer autonomy failures from crash failure. Moreover, it directly reflects the occurrence rate and form of volunteer autonomy failures in a scheduling procedure by constructing volunteer groups according to volunteer availability and volunteering service time. However, existing peer to peer grid computing systems do not provide a group based fault tolerance algorithm. 3) The MAASM focuses on the adaptive properties of volunteers such as volunteer autonomy failures, volunteer availability, and volunteering service time in a scheduling procedure. On the other hand, existing scheduling mechanisms mainly focus on the basic properties such as CPU, memory, and network capacities in a scheduling procedure. 4) The MAASM exploits mobile agent technology in a distributed way without direct control of volunteer server. In particular, the scheduling mobile agents apply different scheduling, fault tolerance, and replication algorithms coping with environmental changes. As a result, it can adapt to a dynamic peer to peer grid computing environment, and reduce the overhead which occurs in existing scheduling mechanisms.

6. Implementation & Evaluation 6.1. Implementation We implemented our adaptive scheduling mechanism on the basis of the ”Korea@Home” [36-39] and ”ODDUGI” mobile agent system [42-43]. Korea@Home project attempts to harness a massive computing power using the great numbers of PCs 23

[ If Ψ occurs ] // At the T-MA side T-MA’ = Checkpoint(T-MA); if (T-MA is not restarted within the interval) then if (T-MA ∈ A’ || T-MA ∈ C’ || T-MA ∈ B’) then Send InterferenceFailure message to S-MA; fi; if (T-MA receives the candidate volunteer Vm ) then if (checkpointing is used) then SendT-MA’(Vm ); else ReplicateT-MA(Vm ); fi; fi; fi; // At the S-MA side if (S-MA receives InterferenceFailure message) then Vm = SelectNewVolunteer(); SendCandidateVolunteer(Vm ); fi;

Figure 13. Fault tolerant algorithm in presence of failures of T-MA (3)

distributed over the Internet 4 . In addition, the ODDUGI developed by our laboratory is a mobile agent system that support reliable, secure, and fault tolerant execution of mobile agents. Fig. 14 shows the execution screen shot in Korea@Home. Now, Korea@Home has 6,744 volunteers and 524 of them are active on average. We conducted performance measurements over one month (i.e., July 2005). Fig. 15 (a) and (b) show daily performance (412.43 Gflops in maximum and 352.46 Gflops on average) and hourly performance (356.53 Gflops in maximum and 265.09 Gflops on average), respectively. In Korea@Home, volunteers can take part in one of three kinds of applications: global risk management, new drug candidate discovery, and climate prediction. The CPU types of volunteers are somewhat various, but the majority is almost similar CPU performance. For example, Intel Pentium 4 is about 55%, Pentium III is about 12%, Celeron is about 6%, and so on (see Fig. 16).

6.2. Evaluation We evaluate our MAASM with existing scheduling mechanism. Our evaluation focuses on how much we will gain performance improvement depending on whether the volunteer groups are considered in a scheduling procedure or not. To this end, we intentionally set up volunteer groups which have different volunteering service time Θ and volunteer availability αv . We compare our adaptive scheduling mechanism with eager scheduling. In eager scheduling5 [7-9, 12-14], a volunteer 4 Korea has a very advanced high-speed Internet infrastructure. It has been known that Korea has the best Internet utilization among any other countries in the world. In Korea, the popularization of PC is about 77%, and high-speed Internet users are about 57%. The features are distinguished from other countries, for instance, US, Singapore, France, or England. This advanced infrastructure benefits building a peer to peer grid computing (i.e. Internet-based distributed computing) environment with low costs and high efficiency. 5 Eager scheduling is very similar to FCFS(First Come First Served) or FIFO (Fist In First Out).

24

Figure 14. Screen shot of Korea@Home

asks its volunteer server to a new task as soon as it finish its current task. As a result, the more eager a volunteer works, the more tasks are executed. There are a lot of scheduling heuristics in grid computing environment, e.g., MCT, MET, SA, KPB, min-min, max-min, and sufferage heuristics [23, 24]. In this paper, we adopt eager scheduling among existing scheduling heuristics because it is more straightforward and simple than other heuristics in grid computing. In particular, the eager scheduling has been mainly used in a dynamic peer to peer grid computing environment [7-9, 11-14] because it is more adaptive to the dynamic environment than the heuristics in grid computing. We make use of a simulation to evaluate our scheduling mechanism. Our simulation was conducted with real volunteers in Korea@Home. The used application is a new drug candidate discovery. A task in the application exhibits 16 minutes of execution time on a dedicated Pentium 1.4 GHz. Table 36 shows the simulation environment with different volunteer groups, volunteering service time, and volunteer availability. For each cases of Table 3, 200 volunteers participate in our simulation during one hour. In Case 1, A’ volunteer group has more volunteers than other groups. Case 2 represents that more volunteers belong to A’ and C’ volunteer groups as compared to other groups. In Case 3, A’ and B’ volunteer groups have more volunteers than other groups. In Case 4, D’ volunteer groups has more volunteers than other groups. On analyzing Table 3, we can observe that Case 1 has larger volunteer availability and volunteering service time than other cases. Case 4 has smaller volunteer availability and volunteering service time than other Cases. Based on this simulation environment, our 6 The volunteer groups vary over time. At first, the distribution of volunteer groups is similar to Case 1. As time goes, the distribution is changed into Cases 3 and 4. On average, the distribution is similar to Case2.

25

Volunteers

Gflops

Volunteers

Gflops

(a)

(b)

Figure 15. Performance Trace

Figure 16. CPU type

simulation is conducted 10 times for each cases of Table 3. As shown in Table 3, the 200 volunteers have various volunteer autonomy failures, volunteer availability, and volunteering service time. We assume that the range of M T T V AF is 1/0.2 ∼ 1/0.02 minutes and M T T R is 3 ∼ 10 minutes. As the performance metrics for comparing the MAASM with eager scheduling, our simulations use the number of completed tasks and the number of redundancy. Also, we measure the number of completed tasks depending on whether replication is applied or not. We measure two performance metrics on the basis of scheduling groups (i.e. A’D’ and C’B’). Fig. 17 shows the average number of completed tasks. In Fig. 17, ES and AS represent existing eager scheduling and our MAASM, respectively. Also, AS(A’D’) and AS(C’B’) represent each scheduling group in the MAASM (Note that the sum of AS(A’D’) and AS(C’B’) equals to AS. As shown in Fig. 17, the MAASM completes more tasks than existing eager scheduling. The obtained results indicate the following factors. First, A’ volunteer group has an important role of gaining better performance. When the number of members in A’ volunteer group decreases gradually (i.e., from Case 1 to Case 4), the number of completed tasks also decreases. Second, the number of members of A’ and C’ volunteer groups is more important than that of B’ and D’ volunteer groups. For example, Case 1 and 2 have more completed tasks than Case 3 and 4. Third, volunteer availability is tightly related with performance improvement. For instances, Case 1 with highest volunteer 26

Case # of vol. Case1 αv Θ # of vol Case2 αv Θ # of vol Case3 αv Θ # of vol Case4 αv Θ

Table 3. Simulation Environment A’ B’ C’ D’ Total 127 (63%) 30 (15%) 35 (17%) 9 (5%) 200 0.95 0.95 0.74 0.77 0.91 43 15 31 11 35 min. 95 (47%) 26 (13%) 63 (32%) 16 (8%) 200 0.9 0.9 0.65 0.65 0.80 40 14 28 9 30 min. 78 (39%) 75 (37%) 16 (8%) 31 (16%) 200 0.95 0.95 0.70 0.61 0.88 31 11 25 8 20 min. 52 (26%) 48 (24%) 23 (12%) 77 (38%) 200 0.85 0.85 0.56 0.54 0.70 28 9 22 7 15 min. # of vol.: the number of volunteers

availability has completed many tasks than other cases. On the other hand, the completed tasks of Case 4 with lowest volunteer availability are less than those of other cases. Finally, as the number of members in A’ volunteer group decreases gradually and the number of members of B’ and D’ volunteer groups increases, the difference between the MAASM and the eager scheduling increases. This result is anticipated in the sense that, in the eager scheduling, the failed or suspended tasks in A’, B’, C’ or D’ volunteer groups are redistributed to low quality volunteers interchangeably. On the other hand, since the MAASM performs scheduling on a per group basis, the undesired situation does not happen. For example, the failed or suspended tasks in the C’ volunteer groups are not distributed to the B’ and D’ volunteer group. The difference in Case 1 is smaller than other cases because there are more members of A’ volunteer group than other groups. In other words, the undesired situations rarely happen in Case 1.

Figure 17. The average number of completed tasks

27

Fig. 18 shows the average number of completed tasks when replication is used to tolerate volunteer autonomy failures for Case 2. In Fig. 18, the tick value 1.0 in x axis actually represent 0.99 (refer to Eq. 1). From this figure, we can find that as the reliability threshold increases, the number of completed tasks becomes smaller. The obtained results indicate that the more tasks should be replicated to support higher reliability.

Figure 18. The average number of completed tasks in case of replication in Case 2

Fig. 19 shows the average number of redundancy for Case 2. Our MAASM has the lower number of replication than the eager scheduling because the scheduling mobile agent applies the replication algorithm to each volunteer group. That is, it adaptively adjusts the number of redundancy according to the rate of volunteer autonomy failures of volunteer groups. In addition, A’D’ scheduling group has the smaller number of redundancy than C’B’ scheduling group because A’ volunteer group has higher volunteer availability and volunteering service time than C’ volunteer group. Since C’ volunteer group suffers from volunteer autonomy failures more frequently than A’ volunteer groups, the former has the greater number of redundancy than the latter. Therefore, in case of A’ volunteer group, the small number of redundancy satisfies the reliability threshold. In case of C’ volunteer group, the more replication is required to meet the reliability threshold. As a result, A’ volunteer group can execute more tasks because it can reduce the overhead of replication. Finally, as the reliability is more required, the number of redundancy increases. Fig. 20 shows the average number of completed tasks in case of replication. In Fig. 20, the value of 0.8 is used as the reliability threshold. When compared with Fig. 17, the difference between the MAASM and the eager scheduling is larger. In the MAASM, A’ volunteer group can complete more tasks, because it has relatively the small number of replication. On the other hand, the eager scheduling does not consider homogeneous group, so the following undesirable situation happens repeatedly. Suppose that a volunteer in C’ volunteer group suffers from volunteer autonomy failures. In this case, its failed

28

Figure 19. The average number of redundancy in Case 2

task should be distributed to a new volunteer. In the eager scheduling, the new volunteer is selected without considering volunteer groups. If the new volunteer belongs to B’ or D’ volunteer groups, it would also fail because of the high rate of volunteer autonomy failures. If volunteers with low quality are selected continuously, the task is continuously redistributed to other volunteers until a volunteer with high quality is chosen. Such an undesirable situation happens frequently and repeatedly if there are a lot of volunteers which belong to B’, C’, or D’ volunteer group. Thus, the difference between the MAASM and the eager scheduling in Cases 3 and 4 is larger than Cases 1 and 2.

Figure 20. The average number of completed tasks in case of replication (reliability threshold = 0.8)

Fig. 21 shows the number of redundancy for all cases. As the number of member in A’ volunteer group decreases, the 29

difference between the MAASM and the eager scheduling increases. For example, Case 1 has the largest A’ volunteer group, so the number of redundancy of the MAASM is almost similar to eager scheduling. Since Case 2 has many members of A’ and C’ volunteer groups, the gap between the MAASM and the eager scheduling is larger than that of Case 1. Similar results are shown in Case 3 and 4. Compared with eager scheduling, the MAASM has the small number of redundancy because the MAASM calculates the number of redundancy on the basis of volunteer group differently from eager scheduling. In the MAASM, volunteer groups with the high rate of volunteer autonomy failures require the high number of redundancy, and vice versa. Consequently, the MAASM completes more tasks than the eager scheduling. A’ volunteer group can complete more tasks because it has the smaller number of redundancy than the eager scheduling as shown in Fig. 21.

Figure 21. The average number of redundancy

30

7. Related Work 7.1. Scheduling and Fault tolerance AgentTeamwork [18] proposed a mobile agent based PC Grid middleware. AgentTeamwork makes use of mobile agents for resource search or job migration in the presence of crash failure. AgentTeamwork is similar to our system in the sense that it uses mobile agents. However, our mechanism uses mobile agent for scheduling. The scheduling mobile agent is allowed to implement different scheduling algorithms suitable for each volunteer group according to volunteering service time and volunteer availability. Kondo et al. [19, 20] classified clients (i.e., volunteers in this paper) into two classes mainly according to location (i.e., home or workplace) and network: conservative and extreme. Conservative clients model home PCs. They have relatively slow network bandwidth and are sparsely used during after-work hours. Extreme clients, however, model PCs at the workplace. They have relatively fast broadband network and are used all day. In addition, conservative clients is 90% idle, whereas extreme clients is 80% idle. Kondo et al. [19, 20] measured scheduling mechanisms (i.e., timeout and duplication) for five client groups with the following conservative/extreme percentages : 100%/0%, 75%/25%, 50%/50%, 25%/75%, 0%/100%. Even though they classified clients, they used the classification to set up simulation environments. However, our MAASM directly exploits the classification to apply different scheduling, fault tolerance, and replication strategies to each volunteer group by means of mobile agents. Besides, our classification and scheduling mechanisms focus on volunteer autonomy failures, volunteering service time, and volunteer availability at the same time in a scheduling procedure, whereas Kondo et al. are mainly based on CPU capacity, location, and network state in a scheduling procedure. In SETI@Home [1] project, the central server has a role of scheduling and management of volunteers, so it has high overhead. A volunteer takes a checkpoint periodically (every ten minutes). Each time volunteers are interrupted by the user or a system failure, the computation will resume from the last saved checkpointing. SETI@Home project is based on BOINC[15] which is a middleware for public resource computing. BOINC provides redundant computing to deal with erroneous computational results. However, SETI@Home and BOINC do not provide a scheduling mechanism on a per group basis. XtremWeb [9-11] proposed a FIFO scheduling scheme. It also proposed a spot-checking mechanism on the basis of property testing for result certification. XtremWeb reduces the sample size for spot checking by using property testing. Bayanihan [7, 8] proposed sabotage-tolerance mechanism to tolerate malicious volunteers. It tolerates erroneous results from malicious volunteers by using majority voting and spot-checking mechanisms. The majority voting and spot-checking are based on eager scheduling algorithm. Javelin [12, 13] proposed the advanced eager scheduling algorithm based on tree. However, XtremWeb, Bayanihan, and Javelin do not propose scheduling on the basis of volunteer groups. They also do not exploit volunteer autonomy failures, volunteering service time, and volunteer availability in scheduling phase. Besides, they

31

reduce the overhead of central server by using the multiple servers, whereas our MAASM makes use of mobile agents. CCOF [21, 22] proposed Wave scheduler in which host nodes (i.e. volunteers) are classified into night zone and day zone. Wave scheduler is based on timezone-aware overlay network using CAN. In Wave scheduler, when morning comes to a host node, it randomly selects a host node (i.e. volunteer) in a new target nightzone for migration. However, CCOF does not consider volunteer group which are constructed by volunteer autonomy failures and volunteer availability during classification. In addition, it does not apply different scheduling, fault tolerance, and replication strategies to each zone. Maheswaran et al. [23] proposed dynamic matching and scheduling heuristics for independent tasks in heterogeneous computing systems (i.e. computational grid). They evaluated on-line mode heuristics (MCT, MET, SA, and KPB) and batch mode mapping heuristics (Min-min, Max-min, and Sufferage). Casanova et al. [24] modified existing scheduling heuristics (Min-min, Max-min, and Sufferage) in grid computing environment. They consider input and output data transfer time when computing MCT (Minimum Completion Time), and take into account locality of files. They also proposed XSufferage in which the sufferage value is computed with cluster-lever MCT. Berman et al. [25] proposed application level scheduling (AppLeS) for adaptive application scheduling in a grid computing environment. The AppLeS uses AppLeS agent which is customized for its particular application on potential target resources by means of the performance model. However, existing mechanisms are based on computational grid computing environment. They do not provide a scheduling, fault tolerance, and replication algorithms on the basis of volunteer groups. They mainly focus on CPU load and network bandwidth in a scheduling procedure, whereas our MAASM concentrate on volunteer availability, volunteering time, and volunteer autonomy failures which occur more frequently in a peer to peer grid computing environment. In Condor[17], checkpointing is used for a task to migrate from a non-idle workstation to an idle workstation (in this case, two workstation should be homogeneous). Condor proposed ClassAds for matchmaking and priority based scheduling algorithm. However, Condor is better suited for intra-organizational use, whereas a peer to peer grid computing system is based on Internet. Besides, Condor does not apply different scheduling, fault tolerance, and replication strategies to each volunteer group. Kondo et al. [26] proposed two kinds of availability: host availability and CPU availability. Host availability refers to probability that indicates whether a host is reachable and up. CPU availability means probability that quantifies the fraction of the CPU that can be exploited by a application. In addition, they estimated CPU availability experimentally. Bhagwan et al. [27] empirically characterized host availability in a P2P computing environment. Our volunteer availability is similar to CPU availability. However, we conceptually classify and define volunteer availability which is derived from volunteer autonomy failures. In addition, we provide a volunteer group based scheduling mechanism to directly exploit the availability.

32

7.2. Replication Replication is a well-known technique to improve reliability and performance in distributed systems [33, 34]. Some studies have been made on replication in Grid and P2P computing environment [28-32]. Li and Mascagni [28] proposed computational replication to improve performance in a large-scale computational grid. They proposed how to determine the number of task replicas to meet the performance goals on the basis of node and network failure rates. Ranganathan and Foster [29] proposed dynamic replication strategies for in data grid environment. They provided six different strategies to replicate large amounts of data for the purpose of reducing bandwidth consumption and access latency. Ranganathan et al. [30] proposed dynamic model-driven replication to improve data availability in large peer to peer communities. They provide the methods not only to compute the number of replicas per file, but also to determine the location for a new replica on the basis of storage and transfer costs. Cohen and Shenker [31] proposed replication strategies in unstructured peer to peer networks. They proposed uniform, proportional, and square-root replication to minimize the expected search size. Cuenca-Acuna el al. [32] proposed replication to increase the availability of shared data in unstructured peer to peer systems. They replicated files using an erasure code. Bayanihan [8] proposed sabotage-tolerance mechanism for volunteer computing systems. The proposed mechanism tolerates erroneous results from malicious volunteers by using majority voting and spot-checking mechanisms. In majority voting, the same task is replicated at different volunteers as much as the number of redundancy to meet the desired error rate. Most replication approaches focus on data replication (i.e. replicating data or file), whereas Li and Mascagni [9] and Bayanihan [14] only deal with computational replication (i.e. replicating the execution of task). Data replication is mainly used to improve data availability and access time in a peer to peer network or a data grid computing environment. On the other hand, computational replication is mainly used for fault tolerance or result certification in computational grid computing environment. Li and Mascagni [9] used replication for fault tolerance (i.e. tolerate node and network failures), whereas Sarmenta [14] for result certification (i.e. tolerating malicious volunteers). However, existing computational replication mechanisms do not provide replication algorithm on the basis of volunteer group which are constructed by volunteer autonomy failures, volunteer availability, and volunteering service time.

8. Conclusion & Future Work We proposed a mobile agent based adaptive scheduling mechanism (MAASM) in a peer to peer grid computing environment. In a peer to peer grid computing, the volatility and heterogeneous properties of volunteers should be considered in scheduling procedures because they are tightly related with the completion of computation as well as performance. The MAASM reflects the volatility and heterogeneous properties of volunteers in a scheduling procedure by means of volunteer

33

groups and mobile agents. The volunteer groups are constructed according to the properties of volunteers such as volunteer autonomy failures, volunteer availability, and volunteering service time. The MAASM dynamically applies different scheduling, fault tolerance, and replication algorithms to each volunteer group without direct control of volunteer server. To this end, the MAASM exploits mobile agent technology adaptable to dynamic peer to peer grid computing environment. The evaluation results showed that the MAASM gains better performance and reduces the overhead of computation. In particular, the MAASM completes more tasks than eager scheduling. In case of replication, our MAASM completes much more tasks than the eager scheduling by reducing the number of redundancy. With regard to volunteer groups, the evaluation results showed that the larger the number of volunteers in A’ and C’ volunteer groups is, the larger the number of completed tasks. Also, as the number of volunteers in B’ and D’ volunteer groups increases, the difference between the MAASM and the eager scheduling increases. In addition, the MAASM can reduce the overhead of volunteer server by using scheduling mobile agents according to volunteer groups in a distributed way. We are constructing volunteer groups that are constructed by volunteering service time and volunteer availability as well as volunteer credibility. Volunteer credibility is related with result certification to detect and tolerate erroneous results. We are modifying and extending the scheduling, fault tolerance, and replication algorithms to meet the requirements about credibility.

Acknowledgment This work was supported by the Korea Institute of Science and Technology Information (KISTI).

References [1] SETI@home, ”http://setiathome.ssl.berkeley.edu” [2] Distributed.net, ”http://distributed.net” [3] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu, ”Peer-to-Peer Computing”, HP Laboratories Palo Alto HPL-2002-57, March 2002. [4] D. Barkai, ”Peer-to-Peer computing: Technologies for Sharing and Collaborating on the Net”, 2002. [5] Ian Foster and Adriana Iamnitchi, ”On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing”, 2nd International Workshop on Peer-to-Peer Systems (IPTPS’03), February 2003. [6] F. Berman, G. C. Fox, and A. J. G. Hey, ”Grid Computing : Making the Global Infrastructure a Reality”, Wiley, 2003

34

[7] L. F. G. Sarmenta, S. Hirano. ”Bayanihan: Building and Studying Volunteer Computing Systems Using Java”, Future Generation Computer Systems Special Issue on Metacomputing, Vol. 15, No. 5/6, 1999. [8] L. F. G. Sarmenta, ”Sabotage-Tolerance Mechanisms for Volunteer Computing Systems”, FGCS, 18(4), 2002. [9] G. Fedak, C. Germain, V. Neri, and F. Cappello, ”XtremWeb: A Generic Global Computing System”, CCGrid’01 workshop on Global Computing on Personal Devices, pp. 582-587, May 2001. [10] O. Lodygensky, G. Fedak, F. Cappello, V. Neri, M. Livny, D. Thain, ”XtremWeb & Condor : sharing resources between Internet connected Condor pool”, CCGrid’03 workshop on Global and Peer-to-Peer Computing on Large Scale Distributed Systems, pp. 382-389, May 2003. [11] C. G. Renaud, N. Playez, ”Result Checking in Global Computing Systems”, ICS’03, pp. 226-233, June 2003. [12] M. O. Neary, S. P. Brydon, P. Kmiec, S. Rollins, and P. Cappello, ”Javelin++: Scalability Issues in Global Computing”, Concurrency: Parctice and Experience, pp. 727-735, December 2000. [13] M. O. Neary, P. Cappello, ”Advanced eager scheduling for Java-based adaptive parallel computing”, Concurrency and Computation: Practice and Experience, Volume 17, Issue 7-8, pp. 797-819, 2005. [14] A. Baratloo, M. Karaul, Z. Kedem, and P. Wyckoff, ”Charlotte: Metacomputing on the Web”, The 9th ICPDCS, 1996. [15] D. P. Anderson, ”BOINC: A System for Public-Resource Computing and Storage”, GRID’04, pp. 4-10, November 2004. [16] A. Chien, B. Calder, S. Elbert, and K. Bhatia, ”Entropia: architecture and performance of an enterprise desktop grid system”, Journal of Parallel and Distributed Computing, Volume 63 , Issue 5, pp. 597-610, 2003. [17] D. Thain, T. Tannenbaum, and M. Livny, ”Distributed Computing in Practice : The Condor Experience”, Concurrency and Computation: Practice and Experience, Volume 17, Issue 2-4, pp. 323-356, 2005. [18] M. Fukuda, Y. Tanaka, N. Suzuki and L. F. Bic, ”A Mobile-Agent-Based PC Grid”, Autonomic Computing Workshop AMS’03, pp. 142-150, June 2003. [19] D. Kondo, H. Casanova, E. Wing, F. Berman, ”Models and scheduling mechanisms for global computing applications”, IPDPS’02, pp.79-86, April 2002. [20] D. Kondo, A. A. Chien, H. Casanova, ”Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids”, SC’04, pp. 19-30, April 2004.

35

[21] V. Lo, D. Zhou, D. Zappala, Y. Liu, and S. Zhao, ”Cluster Computing on the Fly: P2P Scheduling of Idle Cycles in the Internet, IPTPS 2004, 2004. [22] D. Zhou and V. Lo, ”Wave Scheduler: Scheduling for Fater Turnaround Time in Peer-based Desktop Grid Systems”, JSSPP’05, 2005. [23] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund, ”Dynamic Matching and Scheduling of a Class of Independent Tasks onto Heterogeneous Computing Systems, HCW’99, pp. 30-44, April 1999. [24] H. Casanova, A. Legrand, D. Zagorodnov, F. Berman, ”Heuristics for scheduling parameter sweep applications in grid environments”, HCW 2000, pp. 349-363, 2000. [25] F. Berman, R. Wolski, H. Casanova, W. Cirne, H. Dail. M. Faerman, S. Figueira, J. Hayes, G. Obertelli. J. Schopf, G. Shao, S. Smallen, N. Spring, A. Su, and D. Zagorodnov, ”Adaptive Computing on the Grid Using AppLeS”, IEEE Transactions on Parallel and Distributed Systems”, Vol. 14, No. 4, pp. 369-382, 2003. [26] D. Kondo, M. Taufer, J. Karanicolas, C. L. Brooks, H. Casanova and A. Chien, ”Characterizing and Evaluating Desktop Grids: An Empirical Study”, IPDPS’04, pp. 26-35, April 2004. [27] R. Bhagwan, S. Savage, G. M. Voelker, ”Understanding Availability” The 2nd International Workshop on Peer-to-Peer systems, February 2003. [28] Y. Li and M. Mascagni, ”Improving Performance via Computational Replication on a Large-Scale Computational Grid ”, CCGRID 2003, pp. 442-448, May 2003. [29] K. Ranganathan and I. Foster, ”Identifying Dynamic Replication Strategies for a High-Performance Data Grid”, GRID 2001, pp.75-86, November 2001. [30] K. Ranganathan, A. Iamnitchi and I. Foster, ”Improving Data Availability through Dynamic Model-Driven Replication in Large Peer-to-Peer Communities”, CCGRID2002, pp. 346 -351, May 2002. [31] E. Cohen and S. Shenker, ”Replication Strategies in Unstructured Peer-to-Peer Networks”, SIGCOMM’02, pp. 177-190, October 2002. [32] F. M. Cuenca-Acuna, R. P. Martin and T. D. Nguyen, ”Autonomous Replication for High Availability in Unstructured P2P systems”, SRDS 2003, pp. 99-108, October 2003. [33] P. Jalote, ”Fault Tolerance in Distributed Systems”, Prentice-Hall, 1994 [34] A. S. Tanenbaum and M. V. Steen, ”Distributed Systems: Principles and Paradigms”, Prentice Hall, 2002.

36

[35] K. S. Trivedi, ”Probability and Statistics with Reliability, Queuing and Computer Science Applications”, Second Edition, WILEY, 2002. [36] Korea@Home, http://www.koreaathome.org/eng/ [37] M. Baik, S. Choi, C. Hwang, J. Gil, H. Yu, ”Adaptive Group Computation Approach in the Peer-to-Peer Grid Computing Systems”, AGridM 2004, Semtember 2004. [38] S. Choi, M. Baik, C. Hwang, J. Gil, and H. Yu, ”Volunteer Availability based Fault Tolerant Scheduling Mechanism in Desktop Grid Computing Environment,” NCA-AGC2004, pp.476-483, August, 2004. [39] S. Choi, M. Baik, C. Hwang, J. Gil and H. Yu, ”Mobile Agent based Adaptive Scheduling Mechanism in Peer-to-peer Grid Computing”, ICCSA 2005, LNCS 3483, pp. 936-947, May 2005. [40] P. Maes, R. H. Guttman, and A. G. Moukas, ”Agents That Buy and Sell”, Communications of the ACM, Vol. 42, No. 3, pp. 81-91, March 1999. [41] D. Wong, N. Paciorek, and D. Moore, ”Java-based Mobile Agents”, Communication of the ACM, Vol. 42, No. 3, pp. 92-102, March 1999. [42] ODDUGI mobile agent system, http://oddugi.korea.ac.kr/ [43] S. Choi, M. Baik, and C. Hwang, ”Location Management & Message Delivery Protocol in Multi-region Mobile Agent Computing Environment,” ICDCS 2004, pp. 476-483, March, 2004.

37