SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2011; 4:216–238 Published online 05 January 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/sec.287

RESEARCH ARTICLE

Internet epidemiology: healthy, susceptible, infected, quarantined, and recovered Suleyman Kondakci* and Cemali Dincer Faculty of Eng. & Computer Sciences, Izmir University of Economics, Sakarya Caddesi No.156, 35330 Balcova--Izmir, Turkey

ABSTRACT This paper presents a recurrent epidemic model (REM) to explore the dynamics of Internet epidemiology through the phases of susceptibility to recovery. From both theoretical and practical standpoint, it has two main differences compared to the bare worm propagation modeling. In the ﬁrst place, it deﬁnes a unique stochastic model of a general infection spread. In the second place, it models the recovery process as a stochastic queueing system, which accurately partitions diagnose, quarantine, disinfection and recovery processes and complements it as a recurrent failure-repair management model, which is entirely unique. There still exists an open question to model propagation patterns of infections and accompanying recovery models needed for effectively managing the infected individuals. The REM model is a unique concept in determining the parameters for estimating the recovery efﬁciency of disrupted systems and for developing long-term recovery strategies under different epidemic situations. Existing infection and worm propagation models can also be used in cooperation with REM in order to analyse necessary quarantine and recovery processes. REM can also be applied for the accurate classiﬁcation of the phases in epidemic dynamics and the states of affected systems in general, and also be used as a guideline for developing stochastic simulations covering various types of systems with recurrent state dynamics in order to facilitate reliability analysis of the systems. Copyright © 2011 John Wiley & Sons, Ltd. KEYWORDS worm propagation modeling; recovery modeling; queueing; simulation; Internet epidemiology; stochastic modeling *Correspondence Suleyman Kondakci, Faculty of Eng. & Computer Sciences, Izmir University of Economics, Sakarya Caddesi No.156, 35330 Balcova-Izmir, Turkey. E-mail: [email protected]

1. INTRODUCTION We present here a new epidemic model for the management of lifecycle of infection phases, carrier propagations, and recovery management in large and scale-free networks such as the Internet. By its nature, epidemiology touches on several areas in which a vast amount of research dealing with epidemic modeling and simulations exists. However, apart from the model presented here, infection handling, recovery, and resource management models are difﬁcult to ﬁnd. Thus, compared to recent propagation models, we also present detailed approaches dealing with recovery modeling of failed systems. Recurrent epidemic model (REM) offers an extended set of state transition structure in order to analyse the necessary system recovery functionality. To achieve the desired accuracy, we deﬁne a complex stochastic model consisting of a sequence of stochastically dependent states. In REM, a transmitter node cannot remain as a transmitter forever, nor can a recovered node become immune indefinitely. Recovered nodes will always return to the healthy 216

population (state), where some may become susceptible, and consequently reinfected. The uniqueness of REM lies in the fact that it concretely deﬁnes the recovery phases of infected nodes and goes deeper into the analysis of the recovery phases in accordance with the infection and propagation phases. Although there exist several stochastic worm propagation models [1-6], the REM model is a pure stochastic system with ﬁve major states: healthy, susceptible, infected, quarantined, and recovered. These models, except [2], focus mainly on worm scanning and propagations, while REM focuses more on the quarantine and recovery processes. It is true that not all infected computers remain in the infected state inﬁnitely, but sooner or later return to healthy state, although they may become susceptible again. Unlike REM, many worm models do not consider recurrence and modeling of recovery operations, i.e., mostly, focus deeper on worm scanning strategies [7] and propagation modeling. Classical epidemic models tend to have limitations as most ignore details of recovery modeling, and in general omit the discussion Copyright © 2011 John Wiley & Sons, Ltd.

S. Kondakci and C. Dincer

of stochastic behavior of Internet epidemiology. Indeed, as substantiated in Ref. [1], despite the wide acceptance of classical deterministic models, the underlying propagation behavior of random scan worms has been shown to be stochastic, see [1] for further discussion on the classical Kermack--McKendrick model and the stochastic model considering random constant scan worms. Although the model deals with viral infections inside a computer network, it can also be applied to sparsely interconnected networks of various types of populations. Indeed, an analogy can be drawn between the biological epidemiology, e.g.,Ref. [8--10], and computer epidemiology. In relation to this, at present, a variety of epidemic models have been adapted to computer epidemics. Three of the most generally discussed are brieﬂy introduced in Ref. [11]. Malware carrier mechanisms (agents) transmit viruses to many individuals, often in a stochastic manner. The most widely encountered carrier mechanisms are e-mails, instant messengers, network ﬁleshares, scan worms, and computer programs shared among individuals over the Internet. We use here two sets of terminologies: (1) the terms user-agents, mailboxes, instant messages, and malware are interchangeably used to denote suspicious carrier mechanisms, (2) the terms node, host, vertex, machine, computer, and system are also interchangeably used. We deﬁne an infected node as a repairable system, since the node can be disinfected and restored to operate after a failure. Although there exist several types of infection strategies (peer-to-peer, P2P, ﬁleshares, e-mail attachments, steganographic contents, executable e-mail contents, and messengers), our discussions consider random scan worms. Though there exist various aspects of Internet worms that can be studied, such as worm intelligence, discovery, deployment, and defense, we focus more on the study of recovery characteristics of infections and resource management needed for the control of an eventual outbreak. This is necessary, because, under an epidemic outbreak it is important to understand which capabilities and strategies are required for the effective management of the epidemic outbreak. We need to explore time-dependent states of systems in order to determine functions representing the spread and recovery rates and queueing characteristics of infection discovery, quarantine, and recovery of the infected systems. The determination of these functions plays a central role in obtaining efﬁcient quarantine and recovery strategies in order to increase the operational reliability and overall system efﬁciency. REM deﬁnes a set of ﬂow parameters that represent the growth and extinction rates of infections. The ﬂow parameters focus on critical points of intervention in the infection and recovery pathway. Ideal protection solutions can be achieved by appropriately matching these ﬂow parameters to victim systems so that realistic worm propagation ﬁgures and effective management strategies can be obtained by conﬁguring these parameters. Anti-virus applications can customize and apply the ﬂow parameters to strengthen the worm discovery, disinfection, and protection mechaSecurity Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Internet epidemiology

nisms. Recovery facilities can also beneﬁt from this model by establishing effective resource allocations based on longrun model emulations. Earlier, it was assumed that infected individuals, at any given time, have ﬁxed and equal probabilities for infectious, quarantined, and recovered states. This assumption is true for many existing models [9], however, in reality, there is a random latent period between the time a node is visited by a worm and the time it becomes a transmitter. Likewise, the quarantine of an infected node might be subject to a random delay before it is cured. The delivery of the recovered systems to the healthy population is also subject to random delays. As also stated in Ref. [2], many models assume constant transition rates from susceptible to transmitter, and from transmitter to recovered state, i.e., these models consider homogeneous objects and infectors so that state transitions occur with a constant probability. In reality, attacked systems are non-homogeneous and it is impossible to accurately predict when an infected node will become a transmitter. The passage time from being infected to becoming transmitter (infectious) is called the latent time period, the length of which depends on the way worms operate and protection characteristics of the computers we use. For example, if a virus is deployed via e-mail attachments, and users on a given node do not access incoming e-mails at all, then there is an inﬁnite latent time period. On the other hand, some worms such as CodeRedII [12] and SQL-Slammer [13] may immediately infect vulnerable systems while they were scanned for infection. As discussed in Ref. [14], recently developed analytical models have been used to generate propagation trends which match these historic worm outbreaks but ignore network trafﬁc characteristics. Other models ignore the randomness of the latent time periods. However, in reality, we should not expect the ﬂow rates to be constant each time because of different network topologies, worm deployment techniques, operations, infection patterns, quarantine, and recovery conditions. In some cases, users do not read e-mails, or do not execute vulnerable applications for a considerable length of time. In these cases, the classical epidemic models, e.g., Susceptible--Infected--Recovered (or SIR) models [9], will fail to present an appropriate scheme. Another shortcoming of some existing models is that when a node recovers, it will stay immune to the virus forever. However, depending on level of user vigilance, it is quite common for the same Internet virus, to infect other computers repeatedly. The REM model does not directly consider defense strategies needed to slow down the spreading rate of worms, it rather models the central functions and parameters that can be used to specify characteristics of propagations and recovery processes. These functions can be applied to determining appropriate recovery strategies by using them at different points of the worm propagation and recovery chain in order to optimize the long-term immunization process. As presented here, the methodology of REM and its mathematical properties are analyzed and veriﬁed thoroughly, simulated and tested with relatively small-sized (ca. 100 nodes) network structures, however its relevance to 217

Internet epidemiology

real-world observations of extreme sizes need to be experimented more. Realistic distributions of infectious periods of some existing worms regarding the mobility within a network structure and among a variety of network structures should also be experimented in REM. The distribution of infection growth, infectious duration, and mean infection rates of some known worms within real-world settings need also be experimented with regards to metapopulation theory. Regardless of the theoretic tractability of a model, the effect of connectivity between spatially separated populations (metapopulation) on infection and recovery dynamics is an important factor that may even force the modiﬁcation of any existing propagation model, including REM, if detailed observations with real-world settings can be possible to perform. In order to assess the recovery efﬁciency of REM, some existing propagation models can also be incorporated into REM and experimented together.

1.1. Outline In the following, Section 2 presents a brief overview of related work and existing models, Section 3 introduces the structure of the REM model, Section 4 deals with the recurrent algorithm and state analysis of the model. Section 5 presents the theoretical details of the phases and the state transitions of REM. Experiments, simulations, and some results regarding the recovery and system efﬁciency are given in Section 6. Section 7 concludes the paper.

2. RELATED WORK Enormous effort has been invested in developing worm propagation models, e.g., Ref. [5,15--18], most of which have been derived from the classical SIS and SIR epidemic models, while others are more general and descriptive. Stochastic worm modeling, which can lead to more realistic worm management structures, has been studied more recently [2]. In a comprehensive work, presented in Ref. [1], a density-dependent Markov jump process model for random constant scanning (RCS) worms is presented. Paper [1] also discuss modeling and detection properties of RCS worms in a mathematically rigorous fashion. A hybrid deterministic/stochastic model for the observations of a worm’s scanning behavior is also presented by this valuable work. Another realistic stochastic epidemic model, based on a ﬁnite-Markovian process, is presented in Ref. [2], which mainly determines state transition dynamics of systems for estimating infection, quarantine, and recovery rates of susceptible systems. A survey and trends on Internet worms is presented in Ref. [19], which discusses the concepts and research situations of Internet worms, their execution mechanism, scanning strategies, propagation models, and the critical techniques used for the prevention. An anatomy of worms and analysis of their potency within a speciﬁc network is presented in Ref. [20]. A study of mass-mailing worms is given in Ref. [21], a taxonomy of computer worms 218

S. Kondakci and C. Dincer

is presented in Ref. [22], and a global view of the spread of the Witty worm and its powerful features are dealt with in Ref. [23]. A survey and comparison of Internet worm detection and containment schemes are presented in Ref. [24], where the authors analyze and compare different detection algorithms based on the worm characteristics by identifying the type of worms that can and cannot be detected by these schemes. A Galton--Watson branching process model for characterizing the propagation of RCS worms is considered in Ref. [25], which deals with both uniform scanning worms and preference scanning worms. It also presents the development of an automatic worm containment strategy to prevent the spread of a worm before an initial outbreak occurs. Models are often accompanied with a simulation in order to study various parameters, attack scenarios, and impacts. However, as stated in Ref. [26], a realistic model simulation covering all aspects of the Internet worm propagation requires extensive resource usage, mainly with severe limitations to run on a single CPU system. In general, two main aspects have been studied so far, worm propagation modeling and defense strategies dealing with various worm types (i.e., scan worms and user activated viruses), but recovery modeling has not been necessarily considered. Scan worms are active codes that automatically scan and infect systems in a branching manner. User activated viruses infect victims by user intervention, which often come as e-mail attachments, or being injected into P2P ﬁleshares and in instant messages. These aspects are dealt with in a number of studies, for example, Ref. [27] considers an e-mail worm simulation model that accounts for the behaviors of e-mail users, including frequency of e-mail readings and the probability of opening an e-mail attachment. To study topological impacts, [27] compares email worm propagation on power-law topology with worm propagation on two topologies: small-world topology and random-graph topology. It concludes that the impact of the power-law topology on the spread of e-mail worms is mixed, i.e., e-mail worms spread more quickly on a power-law topology than on a small-world topology or a random-graph topology, but immunization defense is more effective on a power-law topology. Instant messengers and P2P ﬁleshares are increasingly becoming popular throughout the Internet community. Worm propagation through these systems are also intensifying. A statistical modeling and analysis considering instant messaging worms is presented in Ref. [28] and a P2P worm detection and containment approach is given in Ref. [29]. The topology of P2P networks has an important effect on P2P active worm spreading, however, it is very difﬁcult to model this process [30]. For this reason, so far only a very limited number of propagation models considering active worms in P2P networks based on discrete-time methods is proposed [30]. We also need effective proactive worm detection and response approaches in order to cope with increasing number of worm types and attack scenarios, e.g., Ref. [31] considers an integrated proactive framework for defense against spreading worms through the Internet. As it states, Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

the framework is intended for worm detection by recognizing the actions on scanning of hosts in a network, where the containment of worm spreading is performed by limiting and blocking the network packets transmitted by infected hosts. Early detection of unknown worms must be NP-hard, in relation to this, [32] proposes a new method for detecting unknown worms by using hop number distribution of network packets received by a host. It can be intuitively veriﬁed that preventing Internet epidemics can be very difﬁcult. An end-to-end architecture called Vigilante is proposed in Ref. [33] to decelerate Internet epidemics. Vigilante relies on collaborative worm detection at end hosts, but does not require hosts to trust each other. There are several practical defense techniques used to slow down the propagation of worms, e.g. [34], where a distributed anti-worm architecture is proposed to automatically slow down or suspend worm propagation. An automated e-mail virus detection and control scheme using an attachment chain-tracing technique, based on classical epidemiology, is presented in Ref. [35]. An analytical model supported with simulations is presented in Ref. [36], which analyzes different deployment strategies of rate control mechanisms in order to limit the contact rate of worm trafﬁc. An interesting work applying scale-down techniques for approximating global Internet worm dynamics by shrinking the effective size of the network under consideration is given in Ref. [37], where the Slammer worm was simulated to experiment with the worm dynamics. A hybrid quarantine technique is used in Ref. [38] to study strengths and weaknesses of two complementary worm quarantine strategies under various worm attack proﬁles.

3. OVERALL STRUCTURE OF REM We deﬁne ﬁrst an atomic model in which only a single node is considered. A Bayesian network representation describing the causal behavior of the single node infection, which was proposed in Ref. [2], is given in Figure 1. A susceptible node (S), which has connections to some infectious nodes (I), will be infected, depending on the strength of the attacks and the protection mechanisms of

Figure 1. The lifecycle of a single node infection, where depicts a clean node. Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Internet epidemiology

the node. If discovered, the infected node will be quarantined (Q), or it can further become a transmitter (T), which will then become an active worm carrier and propagate the infection through all vulnerable contacts. Consequently, this branching process will continue spreading throughout the Internet until an extinction state. A susceptible node, if protected, will stay healthy (H). A quarantined node will either (most probably) become healthy, stay quarantined, or become transmitter if it could not be recovered. In the following, we present the main model that can be applied to a wide range of epidemic structures with queueing characteristics. A stochastic model is an appropriate approach for this purpose. We assume that all nodes in a given subnetwork (or network) are protected to some extent, assuming, further, that there is at least one infected node in a given network. The infected node is potentially able to spread the virus to neighboring nodes if they are not vaccinated. This means that, at any time, some nodes can be susceptible or contain viruses that are able to infect others. As shown in Figure 2, REM consists of four major phases describing the main states of the network nodes, which are deﬁned to model: (1) susceptibility, (2) infection, (3), quarantine, and (4) recovery characteristics. The protection level against infections affects the susceptibility of the system while the strength of the worm affects the infection phase. The recovery characteristics are mainly dependent on the infection detection skills, extinction methods, inherent anti-virus mechanisms, and the quarantine techniques. There are two important parameters; susceptibility and vigilance of quarantine techniques (or service). The number of contacts with infected nodes and the protection level (or the immunity) of the nodes determine the susceptibility parameter. The susceptibility level of any node depends on the contacting node’s status (whether it is infectious or it has links to infectious nodes) and the following factors: • The infectious node is accessed/contacted via its local ﬁle shares; viruses can be hidden in network folders and ﬁles that are shared among computers. • The infectious node is accessed via P2P ﬁle shares; viruses can be hidden in external network folders and in ﬁles which are shared over P2P connections. • E-mail agent of the infectious node sends infected mail messages to recipients that accept the messages and inadvertently activate the attached viruses. • Scan worm penetrations via the infectious node exploit known vulnerabilities on victim computers’ applications, e.g., the Slammer worm [13]. The vigilance parameter is mostly related to failures caused by human, administrative skills, and the effectiveness of the automated protection systems, if any. As will be detailed later in Section 5, the recurrent model also considers necessary latent time intervals between the subsequent states. For instance, there is a random latent time period during which a system may be infected but not become a virus transmitter. There is yet another delay time 219

Internet epidemiology

S. Kondakci and C. Dincer

which occurs before a system is delivered to the healthy population after it was recovered. For the analysis of the states, we have two classes of stochastic processes: one dealing with the susceptible and infectious populations and the other one dealing with the quarantined and recovered populations. Both of the processes are considered in detail throughout Section 5. Here, as an example, we brieﬂy present the deﬁnition of the stochastic behavior of a susceptible population and assume the same procedure for the evolution of the remaining populations. Let H be the total number of healthy hosts also containing susceptible ones, h0 be the number of hosts randomly selected hosts at time t = 0, and the random variable ξ(t) be the number of successes in a series of h0 Bernoulli trials with probability of success p(t). The values of h0 and p(t) is such that the distribution of ξ(t) is very accurately a Poisson distribution, i.e., the probability of ﬁnding exactly S susceptible hosts is given by βS −β P{ξ(t) = S} = e , S = 0, 1, 2, . . . , S! where β = Eξ(t) = h0 p(t)

(1)

is the average number of hosts found susceptible in t time units. Or, for smaller population sizes with higher probability compared to the above (Poisson) process and for S = 0, 1, . . . , H, we have a binomial distributed probability for the susceptibility growth P{ξ(t) = S} =

H p(t)S [1−p(t)]H−S S

(2)

Equations (1) and (2) yield for lim H → ∞, lim β → 0, lim βH → δs

t→∞

t→∞

t→∞

4. THE RECURRENT ALGORITHM As can be seen from the following state equations and Figure 2, the model has a recurrent characteristic having the major property of returning the recovered systems to the healthy population, which can then be susceptible and become infected by some other virus types. The recurrent model is described by the following sate equations that modify the related population during the time

Figure 2. Diagram of the REM model representing each state of spread and extinction processes.

220

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

respective units τH = t + H , τS = t + S , τI = t + I , τQ = t + Q , and τR = t + R :

h(τH ) = h(t) +

← − [ h (u) s(u) + µ(u) r(u)−β(u) h(u)]du

t

τS

s(τS ) = s(t) +

τH

τI

[γ(u) s(u)−κ(u) i(u)]du t

q(τQ ) = q(t) +

τR

[α(u) q(u)−µ(u) r(u)]du

(3)

t

The stochastic rate parameters deﬁning the state transitions are as follows: • • • • • •

(4)

h (t) = h(t)β(t)δs (t)

t

r(τR ) = r(t) +

r(t + t) = r(t) + qr(t)−rh(t)

τQ

[κ(u) i(u)−α(u) q(u)]du

q(t + t) = q(t) + iq(t)−qr(t)

These equations can be easily veriﬁed by considering changes in input and output values of the states. To do so, we deﬁne ﬁrst the leakage of a state as the number of entities emanating from it during t, and increment as the number of entities entering the state. Hence, the leakages of the ﬁve states are as follows:

← − [β(u) h(u)−[ h (u)−γ(u)]s(u)]du

t

i(τI ) = i(t) +

i(t + t) = i(t) + si(t)−iq(t)

healthy to susceptible β(u), ← − susceptible to healthy h (u), susceptible to infected γ(u), infected to quarantined κ(u), quarantined to recovered α(u), and recovered to healthy µ(u)

State parameters h(t), s(t), i(t), q(t), and r(t) contain the instantaneous number of nodes (entities) in the transition epoch u for the related state. The state equation set given in (3) can be represented in a simpliﬁed discrete form. That is, the number of entities found in each state during an interval (t, t + t] will use the discrete-time proportions hs(t), sh(t), si(t), iq(t), qr(t), and rh(t). Setting t equal for all states as the time epoch, we can write a discrete model expressing the states of the REM algorithm deﬁned by Equation (3). Thus, By referring to Figure 2 and ignoring all constants (δ{s,i,q,r,h} ) and ﬂow-back parameters ← − − ← − − ← ← − h ,← s , i ,← q, − r , and H , the number of entities found in each state during t + t can be expressed in the form of state(t + t) giving the equation set as h(t + t) = h(t) + sh(t) + rh(t)−hs(t) s(t + t) = s(t) + hs(t)−sh(t)−si(t)

← − s (t) = s(t)γ(t)δi (t) + s(t) h (t) ← − = s(t)[γ(t)δi (t) + h (t)] i (t) = i(t)κ(t)δq (t)

q (t) = q(t)α(t)δr (t) r (t) = r(t)µ(t)δh (t)

(5)

Let us now consider the instantaneous increment at time t in the susceptible population illustrated in Figure 3. At time t, the proportion of the healthy population moving to the susceptible population is given as hs(t) = h(t)β(t) The increment s(t) in the susceptible population at time t is (Figure 3) s(t) = h(t)β(t)δs (t)

(6)

Thus, the number of susceptible nodes at time t + t will be sum of increment s(t), decrement s (t), and the current value s(t) as: s(t + t) = s(t) + s(t)− s (t)

(7)

By setting the delta parameters δ{s,i,q,r,h} = 1 and plugging s (t) from Equation (5) into Equation (7), we get ← − s(t + t) = s(t) + s(t)−s(t)γ(t)−s(t) h (t)

Figure 3. Change in rates between healthy and susceptible populations states of REM.

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

221

Internet epidemiology

S. Kondakci and C. Dincer

Substituting s(t) with Equation (6) and, again, setting the delta parameter δs (t) = 1, we obtain ← − s(t + t) = s(t) + h(t)β(t)−s(t)γ(t)−s(t) h (t) (8) Since, hs(t) = h(t)β(t) si(t) = s(t)γ(t) ← − sh(t) = s(t) h (t) Equation (8) becomes s(t + t) = s(t) + hs(t)−sh(t)(t)−si(t)(t) = s(t) + s(t)− s (t) which veriﬁes the solution for the susceptibility part of the equation set (4). Hence, applying the same procedure as above, the numbers in each of the remaining populations can be found to be i(t + t) = i(t) + i(t)− i (t)

Figure 4. Effect of high initial susceptibility and rapid quarantine process.

q(t + t) = q(t) + q(t)− q (t) r(t + t) = r(t) + r(t)− r (t) h(t + t) = h(t) + h(t)− h (t)

(9)

number of contacts. The recovery process involved a single repairman and random discovery of failed machines. The initial parameters were set as β = 0.7999, γ = 0.000126, κ = 1.0, α = 0.1613, µ = 0.09677.

We can simplify this model to a classical deterministic model as dH dt dS dt dI dt dQ dt dR dt

4.1. Changes in populations

= µR−βH = βH−γS = γS−κI = κI−αQ = αQ−µR

(10)

A corresponding Kermack--McKendrik model describing a disease propagation with ﬁnite-time immunity is deﬁned as dS = −βSI + δR dt dI = βSI + γI dt dR = γI−δR dt

(11)

As shown by Equations (3), (4), and (10), compared to existing models (e.g., Equation 11), REM contains two additional states explicitly dealing with quarantine and recovery operations. A simulation result is shown in Figure 4, where the initial susceptibility β and infection γ parameters are chosen for a homogeneous network with constant 222

During a small time interval, a proportion of the healthy population with the ﬂow rate β ﬂows to the susceptible group, see Figure 2. Similarly, a portion of the susceptible population will become infected with a ﬂow rate of γ. A proportion of the infected population will be delivered to quarantine with a ﬂow rate of κ, and a fraction of the quarantined group will ﬂow to the recovered group with a rate of α. Finally, the recovered nodes will be tested and delivered back to the healthy population at the rate of µ. These recursive operations produce the current number of entities that are kept in ﬁve variables for expressing the states of the model; where h(t) denotes the current number of healthy nodes, s(t) the number of susceptible nodes, i(t) the number of infected nodes, q(t) the number of quarantined nodes, and r(t) denotes the number of recovered nodes. Hence, there is a small proportion of each population changed in a small time interval t according to hs(t) = h(t) β(t) ← − sh(t) = s(t) h (t) si(t) = s(t) γ(t) iq(t) = i(t) κ(t) qr(t) = q(t) α(t) rh(t) = r(t) µ(t)

(12)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

The equilibrium conditions of the rates with regard to ﬂow-back parameters are normalized as follows:

State space = {H, S, I, Q, R} From → To Jump rate Description

← − H (t) + β(t) = 1 ← − ← − s (t) + h (t) + γ(t) = 1 ← − i (t) + κ(t) = 1

H →S

← q−(t) + α(t) = 1 ← − r (t) + µ(t) = 1

S→I

(13)

Recall that, at any time epoch in a network, we have both susceptible and infected nodes. Some of these infected nodes have contacts with some other susceptible nodes, then a suitable contact with any of the infectious nodes can infect the susceptible ones if not protected. Assume also that each infective node has a constant number of contacts with the susceptible nodes per epoch (e.g., per hour), a random number of the infected hosts are quarantined, and a fraction of the quarantined group will recover during a ﬁnite number of epochs. These events are modeled as stochastic functions (or states) in the model, see Section 5 for details. Obviously, the model is comprised of a chain of interdependent state functions, where each function performs its tasks in the propagation chain of a state machine. Each function receives its inputs from its predecessor and sends its output data to the next function in the chain, see Figure 2 and Equation (4). As can be seen, the model is a ﬁnite state machine where the operations are recursive, and state changes are stochastic. It can be easily veriﬁed that the output of each of the states will ﬂuctuate during the iterative processes performed for infection, quarantine, and recovery operations. 4.2. State control parameters In order to construct a realistic epidemic system model, we deﬁne a set of ﬂow control parameters to study the dynamic characteristics of the propagation, i.e., infection, annihilation, quarantine, extinction, failure, and repair characteristics of the system. Thus, to express such a stochastic system, state rates will be described by these normalized equations that modify the rates at each time slot t + t: β(t + t) = [hs(t) δs (t)] ≤ 1 γ(t + t) = [si(t) δi (t)] ≤ 1 κ(t + t) = [iq(t) δq (t)] ≤ 1 α(t + t) = [qr(t) δr (t)] ≤ 1 µ(t + t) = [rh(t) δh (t)] ≤ 1

Table I. State transitions and the associated ﬂow parameters of the recurrent states.

(14)

Each of these equations determine the individual ﬂow densities between consequent states so that changes in the states occur in a normalized form from time t to t + . The non-negative iid random parameters (rates), β, γ, κ, α, and µ, express stochastic behaviors of rates at which the state Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

I → {Q, R}

R→H

ˇ ← − h ← − H ← − i ← − s ˛ ← q− ← − r

Healthy to susceptible Susceptible to healthy Healthy to healthy Susceptible to infected Infected to infected Susceptible to susceptible Infected to quarantined Quarantined to recovery Quarantined to quarantined Recovered to healthy Recovered to recovered

Flow ctrl. ıs --ıi --ıq ır -ıh --

transitions occur. The delta parameters δ{s,i,q,r,h} control the ﬂow density between the states in the chain. Table I summarizes the state transitions and related parameters of the entire model shown in Figure 2. The numbers in the states are modiﬁed by the system (state transition) probabilities described in Section 5, which further depend on the individual population size, state transition probabilities, and some environmental factors such as user vigilance and repair service capabilities. The description of the delta parameters are given below. Initial susceptibility parameter δs : Openness of a system characterized by the level of border protection, e.g., e-mail server protection, the number of connections to vulnerable hosts, and average connectivity of a node with infectious nodes, are the major factors that affect the initial susceptibility of a network. This parameter regulates the rate of susceptibility and epidemic growth. Generally, this parameter is useful in determining the epidemic threshold and the potential of the incoming threats. In other words, δs controls the ﬂow rate from healthy to susceptible state by controlling worm scans. For example, the SQL-Slammer worm has a high value for δs . Infection parameter δi : Individual strength parameter δi deﬁnes the strength of a node, which further regulates the rate of infection ﬂow from that node. In addition to the email server protection mentioned above, individual nodes may also be strengthened or, on the contrary, they may have the required immunity against some known threats. Latent time factors for infections are preferably conﬁgured with δi . Thus, a node being visited by a worm may have a latent time period before becoming infectious. In this case δi is the most appropriate parameter to use, which is also deﬁned as susceptible-to-infectious parameter. User vigilance parameter δq : Virus detection abilities and user vigilance are modeled with the user vigilance parameter δq . User awareness, infection discovery techniques, and quarantine queue handling use this parameter. For example, increasing the rate of infectious-to-quarantine can lead to shorter recovery times with the effective use of quarantine time. δq is also called as infectious-to-quarantine parameter. 223

Internet epidemiology

Quarantine parameter δr . The quarantine efﬁciency parameter δr controls the rate at which the infected nodes are cured (diagnoses and removal operations) and delivered to acceptance test by the recovery function. There are various quarantine techniques, therefore, in many cases we need to analyse the consequences of quarantine delays. In short, the efﬁciency of required quarantine operations is controlled with δr , which is also called as quarantine-torecovery parameter. Recovery parameter δh . The recovery parameter δh controls the efﬁciency of harnessing the recovered systems. More precisely, δh controls the rate at which a recovered system is patched, conﬁgured, and returned to the healthy population. For example, a delay in service availability or reduced throughput due to shortage of repairmen can be modeled with δh , also called as operational (or recovery-tohealthy) parameter.

5. EVOLUTION OF THE STATES To simplify the reliability analysis of the affected nodes we classify machine reliability states into two groups of processes, failure and repair analysis, respectively, where the failure analysis makes use of the “patient” (or infection) parameters (tuples), β, δs , and γ, δi , and the repair analysis uses the “care service” parameters, κ, δq , α, δr and, µ, δh . Appropriate modeling of these parameters is important, because they are used to determine the overall state of an optimum system, in order to describe the strength of the protection of individual nodes, quarantine, and recovery efﬁciencies of the failed (infected) systems. The patient parameters, δs , and δi mainly depend on the protection strength of the environment, which can be determined either analytically or empirically, as appropriate. As presented in the following, these parameters are determined from composite probabilistic models based on the average connectivity k with infectious systems, susceptibility, infection, quarantine, and recovery rates. We use here the notations and conventions of queueing theory. Service requests (arrivals) for repairs and repair completions (departures) are modeled as a ﬁrst-comeﬁrst-served (FCFS) Markovian queue with exponential distribution of both interarrival and repair times. That is, the REM model can be thought of as a typical queueing system with some speciﬁc statistical properties that are described in the following sections. Looking at it from a slightly coarse perspective, the infection and extinction processes take place in four consequent phases: (1) initial triggering and susceptibility growth, (2) epidemic outbreak (infection spread), (3) discovery and delivery of infected hosts to the quarantine service, and (4) the recovery phase. Evolution of these phases assumes that at least one host is initially infected (initial triggering), so that a phase of susceptibility growth takes place as a binomially distributed process. Following this phase the spread of worms through the susceptible population takes place as an exponentially distributed process. The infected hosts are then inspected, 224

S. Kondakci and C. Dincer

discovered, quarantined, and delivered to the queue of the repair service. Following the repair process, the repaired nodes are patched, equipped with anti-virus software (vaccinated), and merged into the healthy population. 5.1. Initial susceptibility and epidemic outbreak A host is susceptible if it maintains contact with one or more infectious hosts. The susceptible host, if not protected sufﬁciently, will be infectious (transmitter), and thus, a cascade of worm spread will start slowly, intensify, and annihilate over the course of time. The cascade process is modeled as a random branching process with a binomial generation function, see Refs. [39,40] for a review of branching processes. Indeed, a branching process is a Markov process that models a population in which each individual in generation n produces a random number of individuals in generation n + 1, according to a ﬁxed probability distribution that does not vary from individual to individual. This is true for the same type of worm attacking the same type of machines, but, otherwise, the probability distribution will always differ for various worm types and machines as well. Hence, the initial worm activities take place in two consecutive phases; an initial epidemic threshold and -outbreak, which occurs following the initial threshold time, respectively. Prior to the initial epidemic threshold there is a large number of susceptible hosts while the probability of initial outbreak is relatively small. The spread rate of worms is mainly affected by the worm type and the number of susceptible contacts with infectious hosts. For example, in an insufﬁciently protected network, the Nimda worm [41], has an initial infection rate of P(n) = (1−n/78.6)−0.463 expressed by the number of infectious hosts, n, that have contacts with the newly infected host. More formally, assume that a number of susceptible individuals are selected at random and inoculated with the virus. There is a probability ξk (t) per time unit at which the infectious node k will transmit the virus to a neighboring susceptible node. Therefore, if a susceptible node ϑ has n independent infectious neighbors during the time t, the total probability P(ϑ) that the node becomes infected during that time unit is P(ϑ) = 1−

n

[1−ξi (t)]

i=1

5.1.1. Degree of connections and critical threshold. Due to its growth behavior and vertex structure, the Internet bears the characteristics of a scale-free network, which is deﬁned by the power-low degree distribution P(k) ∼ k−ϒ ; 2 ≤ ϒ ≤ 3. The degree (k) of a node is deﬁned as the number of links adjacent to it, i.e., the number of connections to other vertices. That is, the probability that a node has connections to k other vertices is deﬁned by the power-law Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

probability distribution P(k). Scale-free networks are characterized by two major properties, growth and preferential attachment. The growth property implies that the number of nodes in a network increases in time, while latter refers to the fact that new nodes tend to connect to each existing node to a great extent. Thus, probability P(ki ) that a new node will be connected to node i depends on the degree ki of node i is given by [42]: ki P(ki ) ∼

only for R0 > 1. In this case the epidemic is able to generate a number of infected individuals larger than those which are removed, leading to an increase of the infected individuals i(t) at time t following the exponential form i(t) i0 et/Td Here i0 is the initial density of infected individuals and Td is the typical outbreak time expressed as [10] Td =

ki

i

The probability that a link points to a node with c connections is given by cP(c)/k, where k denotes average degree of connectivity. Thus, the average probability (λ) of a link pointing to an infected node in a homogeneous network having the density of infected nodes ρk is computed as (λ) =

1 k P(k)ρk k

(15)

k

hence, the critical epidemic threshold λc is deﬁned as

k λc = k 2

k P(k)λc k

k

= 1 ⇒ λc =

k k2

1 µ(R0 −1)

5.2. Epidemic spreading The epidemic threshold is crossed when at least one contact to infectious nodes is being infected and makes contact with other susceptible nodes. A branching process will take place after the initial epidemic outbreak. The probability that each of the ﬁxed number of contacts or viruses causing k infections is given by pk (t). Prior to the epidemic outbreak, the branching model (ξ) is a random variable having a binomial distribution with parameters p and n, so that Pξ {k|n, p} =

n k (n−k) pq , q = 1−p, k = 0, . . . , n k

(16)

(18)

For homogeneous networks, in which the connectivity ﬂuctuations are negligible, the total prevalence ρ(t) is deﬁned in Ref. [43] as the density of infected nodes present at time t:

Let ξ(t) be the total number of infectious hosts at time t representing a Markov process (continuous time Markov chain). Suppose we have exactly k infectious hosts initially at time t = 0, and let ξi (t) be the number of infections generated by the ith infectious host after a time t, then ξ(t) = ξ1 (t) + . . . + ξk (t), where the random variables ξ1 (t), . . . , ξk (t) are independent and have the same probability distribution P{ξi (t) = n} = pn (t), n = 0, 1, 2, . . . Let pkn (t) be the probability of k infectious hosts causing n new infections after time t, so that the numbers pkn (t) are the transition probabilities of the Markov process ξ(t). For the analysis of this Markov process, we assume that we have a ﬁnite (or countable) state space I = {0, 1, 2, . . . , }. If we are in a certain state (e.g., j ∈ I) at time t in a Markov process, then we can compute the probability of being in a different state k ∈ I at time t + t expressed as

k

dρ = −ρ(t) + λkρ(t)[1−ρ(t)] dt

(17)

5.1.2. Reproductive number. A fundamental parameter in epidemiology is the basic reproductive number R0 [44], which in most epidemic models has a speciﬁc value, the epidemic threshold, above which epidemics are possible, but below which epidemics cannot occur [45]. The reproductive ratio R0 is deﬁned as the expected number of secondary infections of an initial infectious individual in a completely susceptible host population, and is related to the likelihood and extent of an epidemic. Under the assumption of a homogeneously mixing population, if an infected individual is in contact with k other individuals, the basic reproductive number is deﬁned as R0 =

λk µ

where λ is the spreading rate, deﬁned as the probability that a susceptible individual in contact with an infectious individual will contract the disease, and µ is the removal rate of infected individuals, either to the susceptible or the recovered states. It is easy to understand that any epidemic will spread across a susceptible fraction of the population Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

pk (t) = P{ξ(t + t) = k|ξ(t) = j} Since we are interested in the evolution of the worm spread for t > 0 distributed over small time steps, we need to determine the value for pk at the origin. From here we can ﬁnd rate of transition (γ) to infected state. Assuming that pk is differentiable at 0, its derivative p k (0) can be obtained as p k (0) = lim

t→0

= lim

t→0

P{ξ(t + t) = k|ξ(t) = j} t pk (t)−pk (0) = γjk t−0 225

Internet epidemiology

S. Kondakci and C. Dincer

We can write this as P{ξ(t + t) = k|ξ(t) = j} = γjk t + o(t)

(19)

where o(t) denotes an inﬁnitesimal of higher order than t having the property limt→0 o(t) = 0. t We have the Markov property stating that if we know the state ξ(t) then all additional information about ξ at times prior to t is irrelevant for the determination of future states. That is, for all k = j, t0 < t1 , . . . , tn < t and x0 , . . . , xn ∈ I we have P{ξ(t + t) = k|ξ(t) = j, ξ(ti ) = xi } = P{ξ(t + t) = k|ξ(t) = j} = γjk t + o(t),

i = 1, 2, . . .

(20)

Let us deﬁne γij as the rate at which the infection process enters state j from state i, which is also termed as transition intensity from i to j. We can deﬁne a Q-matrix, Q = (γij : i, j∈I), containing all information about the transitions within the state space of infection I. The Q-matrix of a Markov process has the following properties: • all diagonal entries γii are non-positive stating no change in the current state, • all non-diagonal entries γij , i = j, are non-negative stating a transition from the current state, • the sum over the entries in each row is zero. The probability of the infection process undergoing a change of states in a small time interval t given by Equation (22) can be veriﬁed as follows. During random worm scans, transitions occur randomly in such a way that the probability of a single infection in the time interval (t, t] is p1 (t) = γt + o(t) the probability of producing more than one infections is o(t). That is, the probability of a single infectious host causing one new susceptible (or infected but not yet discovered) hosts in a small time interval t is p1 (t) = γt + o(t) and the probability of not causing any infection by that host is p0 (t) = 1−γt + o(t)

(γt)n n!

where ξ(t) is Poisson distributed with parameter γt. 226

pk =

γ(t) + o(t) if k = j + 1 0 + o(t) if k > j + 1 0 if k < j

Thus, we have the Q-matrix

−γ 0 0 0

γ −γ 0

0 γ −γ

...

0

0 0 γ .. .

... ... 0 ... .. .. . .

Setting i(0) = 0, an analytical solution will lead us to the general propagation formula (21), which is a Poisson process with rate γ: pn (t) = P{i(t) = n} = e−γt

(γt)n , n!

for n = 0, 1, 2 . . . (22)

where γ = Ei(t) gives the average number of nodes found infected in t time units.

5.3. Infection growth as a branching process Following the initial epidemic outbreak a worm reproduction process (branching) will start, which can be described as a pure birth-process having the following properties: • a single infection with probability p = γt + o(t), • more than one infections with probability o(t), • zero infection with probability q = 1−γt + o(t). Let i(t) = γt + o(t) = n be the number of infections at time t and q = 1−i(t). Then the probability of no infections at time t + t is P{i(t + t) = n|i(t) = n} = qn = [1−γt + o(t)]n = 1−nγt + o(t)

Occurrences in a Markov process during the time interval (t, t + t] are independent of the occurrences before time t. This and above properties lead us to a general propagation process having Poisson characteristics with rate γ, i.e., assuming ξ(0) = 0 and n = 0, 1, 2, . . ., we have pn (t) = P{ξ(t) = n} = e−γt

Following this brief introduction, we can now build the Q-matrix containing the required transition rates for the infection process. Assume the pk = P{i(t + t) = k|i(t) = j} gives the probability of k infections in an infection event occurring as a Markov process is deﬁned as

(21)

(23)

The probability of one infection at time t + t, given i(t) = n, is P{i(t + t) = n + 1|i(t) = n} = ni(t)qn−1 = n[γt + o(t)]qn−1 = nγt[1−γt + o(t)]n−1 = nγt + o(t)

(24)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

Finally, The probability of more than one (k) infections at time t + t, given i(t) = n, is

Substituting s = −γk gives γ (n−1) (n−1)! = ak

P{i(t + t) = n + k|i(t) = n} n i(t)k qn−k + o(t) k

Hence, (n−1)! ak = = (j−k)

n [γt + o(t)]k qn−k + o(t) k

=

n k γ (t)k [1−γt + o(t)]n−k + o(t) k

= o(t)

(25)

−γ 0 0

γ −2γ 0

0 2γ −3γ

...

0

0

0 0 3γ .. .

... ... 0 ... .. .. . .

s+γ 0 sI−Q = 0 0

∀(n≥1)

(26)

−γ s + 2γ 0

0 −2γ s + 3γ

...

0

0 0 −3γ .. .

... ... 0 ... .. .. . .

The coefﬁcients rin , n = 1, 2, . . . of the inverse matrix satisfy r11 (s + γ) = 1,

and

r1,n−1 [(−(n−1))γ] + r1,n (s + nγ) = 0,

for n ≥ 2

(27)

By solving this we get

1 (n−1)γ r1,n−1 = γ n−1 (n−1)! s + nγ s + kγ n

r1n =

n n−1

k−1

(−1)k−1

1 s + kγ

Inversion of the Laplace transform will give the probability of one infection giving rise to n offsprings, p1n (t): n n−1 k=1

As can be seen, i(t) is a geometrically distributed random process with parameter e−γt as lim t → ∞. We can derive Equation (26) by use of the resolvent method as follows. Let s stand for the Laplace parameter needed for the approximation of exponential integrals. To proceed, we need to invert

r1n =

p1n (t) =

Assuming γ1 = −γ and γ2 = γ, and i(0) = 0, then it can be readily veriﬁed that the infection will evolve as a branching process described by pn (t) = P{i(t) = n} = e−γt (1−e−γt )n−1 ,

n−1 (−1)k−1 k−1

from which we have

k=1

Hence, i as a Markov process having the following Qmatrix

j =k

=

[γ(j−k)]

j =k

=

= e−γt

k−1

n−1 n−1 k=0

pn (t) = e

−γt

(−1)k−1 e−kγt

k

(−1)k e−kγt

(1−e−γt )n−1

(28)

A similar result is obtained in Ref. [25], which determines the total progeny of the branching process as a Borel--Tanner distribution, i.e., P{I = k} =

I0 (kγ)(k−I0 ) e−kγ k(k−I0 )

(29)

where, I0 denotes the number of initially infected hosts, P{I = k} denotes the probability that the total number of hosts infected is k (varying as k = I0 , I0 + 1, . . . , I0 ), and γ denotes the mean Poisson rate. The results of a branching process expressed by Equation (28) are shown in Figure 5. The branching rate γ is always initially high, and as the branching evolves over time the probability pn (t) of producing new n offsprings diminishes exponentially conforming to the behavior of the deterministic simple epidemic model. In Figure 5a, we have a constant branching rate and varying target offsprings 40, 60, and 100. However, as shown in Figure 5b, we observe the effect of the branching rate γ, which indeed depends on the worm scan strategy. For convenience we based our tests on the uniform scanning worms.

k=1

We can use partial fractions to ﬁnd ak , k = 1, . . . , n such that n ak r1n = s + kγ k=1

Thus, we obtain γ

(n−1)

(n−1)! =

n k=1

ak

(s + jγ)

j =k

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

5.3.1. Annihilation of the infection growth. A single branching process can be modeled as an M/M/1queue, which is stochastic having exponential branching rate γ and exponential annihilation rate µ. That is, infections at a node occur with rate γ and removal of the infections occurs with rate µ. Let i(t) be the number of infected nodes in the queue each undergoing an annihilation process withe rate µ. Annihilation is a process where only a pair of nodes (the attacker and the victim) are involved at a time. A worm 227

Internet epidemiology

S. Kondakci and C. Dincer

a) Branching, constant γ = 0.086 0.01

n = 40

0.008

n = 60 0.006

pn(t)

n = 100

0.004

0.002

0 0

20

40

60

80

100

120

140

160

140

160

b) Branching, constant n = 40 0.01

0.008

pn(t)

0.006

γ = 0.05

γ = 0.09 γ = 0.07 0.004

0.002

0 0

20

40

60

80

100

120

Time

Figure 5. The initial worm outbreak modeled as a branching process.

attacks a susceptible node, annihilate each other, and if the susceptible node gets infected it launches one or more uniform scans against a set of other nodes. The scan of other nodes will be broken during an annihilation process if the target node does not get infected. A related Markovian theorem states that if A and B are independent exponentially distributed random variables with rate φ and ϕ, respectively, then their minimum m(A, B) is also exponentially distributed with rate φ + ϕ and it is independent of any event resulting in A = m(A, B) and vice versa. Hence, it is stated that P{m(A, B) = A} = and P{m(A, B) = B} =

−γ µ 0 0

γ −(γ + µ) µ

0 γ −(γ + µ)

0

0

... ... 0 ... .. .. . .

This indicates that there is at least one infected host in the queue, the distribution of the transition (jump) is exponential with rate γ + µ, and the probability that the queue is becoming shorter is µ/(γ + µ). 228

pn (t) = (1−γ)(1−µ)(γ)n−1 where

(30)

ϕ φ+ϕ

0 0 γ .. .

p0 (t) = µ

=

φ φ+ϕ

Then, the Q-matrix of this Markov process becomes

Now, assuming the same single infection that is being annihilated with probability µt + o(t) in a small time interval t, the probability of n infections being annihilated can be found by ﬁrst setting γ0 = µ, γ1 = −(γ + µ), γ2 = γ, ∀(n≥0) :

1−e(γ−µ)t µ−γe(γ−µ)t t 1+γt

(31)

if γ =

µ if γ = µ

The results of the annihilation of a branching process is illustrated in Figure 6. As can be clearly seen from Equation (31), the annihilation is dominated by the value of the annihilation rate µ and the branching rate γ combined with the number of offsprings. As shown in Figure 6, the probability of the proliferation of infected nodes intensiﬁes initially but dumps slowly during the course of time. Total extinction probability p0 of the branching process having the growth rate γ and recovery rate µ can be easily veriﬁed to be p0 =

γ µ

1

if µ < γ if µ > γ

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

a) Annihilation, constant µ = 0.001 and γ = 0.099 0.01

n = 40 0.008

n = 60 n = 100

pn(t)

0.006

0.004

0.002

0 0

50

100

150

200

250

300

350

400

450

500

400

450

500

b) Annihilation, constant n = 40 0.01

γ = 0.99 µ= 0.001

0.008

n

p (t)

0.006

γ = 0.05 µ = 0.005

γ = 0.9 µ = 0.01

0.004

0.002

0 0

50

100

150

200

250

300

350

Time

Figure 6. The annihilation of a worm outbreak modelled as a branching process.

5.4. Infected-to-quarantine state Following the discovery of the infections, infected hosts will be delivered to the quarantine service for diagnose and disinfection operations. We assume here that we have a very large number of nodes initially with considerably low probability of infection. Thus, the initial process uses the random variable κ for Poisson arrivals of infected hosts assumed both for queueing and quarantine processes: P{ξ(t) = k} =

κk −κ e , k!

k = 0, 1, 2, . . .

(32)

As before, κ = Eξ(t), denotes the average number of quarantined nodes. The times between arrivals to the quarantine service is a Poisson process with the exponential distribution parameter κ. Suppose that the quarantine requests for infected nodes occur randomly in the course of time, where let ξ() be the number of requests occurring during the time interval . Then, by ﬁrst determining the distribution of the random variable ξ(), we can derive a probability distribution function for the quarantine process. Since the request arrivals for quarantine are independent of one another, we can split the given time slot into the non-overlapping intervals ξ(1 ), ξ(n ), . . ., ξ(n ) for the overall arrivals. The probability that at least one arrival occurs in a small time interval t is κt + o(t), while the probability that more than one arrivals occurs in t is Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

o(t). Here, the parameter κ denotes the rate of the arrivals at the quarantine service. Let ξ(t) be the total number of arrivals occurring in the interval (0, t]. Dividing (0, t] into n equal parts 1 , . . . , n , we get ξ(t) =

n

ξ(k )

k=1

where ξ(1 ), . . . , ξ(n ) are independent random variables of which ξ(k ) is the number of arrivals occurring in the interval k . The probability generating function of each random variable ξ(k ) is

Gn (z) = 1−

κt n

+

t κt z+o n n

where, again, o(t/n) is a term of order higher than t/n. Since ξ(1 ), . . . , ξ(n ) are independent random variables taking the values 0, 1, 2, . . ., then the random variables zξ(1 ) , . . . , zξ(n ) , where z is a ﬁxed number, are also independent. It follows from the formula of mathematical expectation E E[ξ(1 )ξ(2 )] = E[ξ(1 )]E[ξ(2 )] and setting ξi = ξ(i ) as a shorthand notation, we get Ez(ξ1 +ξ2 +...+ξn ) = Ezξ1 Ezξ2 . . . Ezξn 229

Internet epidemiology

S. Kondakci and C. Dincer

Thus we have the formula Gξ (z) = Gξ1 (z)Gξ2 (z) . . . Gξn (z)

(33)

expressing the generating function Gξ (z) = Ezξ of the sum ξ = ξ1 + . . . + ξn of the n random variables ξ1 . . . ξn in terms of the generating functions Gξk (z) = Ezξk of the n separate summands. Hence, by combining Equation (33) with the above explanation, the generating function of ξ(t) is

t κt κt G(z) = [Gn (z)] = 1− + z+o n n n t n κt(z−1) +o ] = [1 + n n n

n→∞

t κt(z−1) +o n n

(34)

n

= eκt(z−1)

This is the generating function of a Poisson distribution with parameter κt, so that P{q(t) = k} =

(κt)k −κt e , k!

k = 0, 1, 2, . . .

P{T > t} = e−µt

n

Since G(z) is independent of the subintervals 1 , . . . , n , we can take the limit as n → ∞, achieving G(z) = lim 1 +

incoming service requests of the Poisson type, the repair center has exponential arrival rate α and departure rate µ. The random ﬂow of service requests arriving at the quarantine queue for repair presents an exponentially distributed arrival with density α. Thus, αt + o(t) is the probability that at least one call (infected host) arrives in the small time interval t. We assume that the random repair time T for each incoming repair request at time t has an exponential distribution with parameter µ, i.e.,

(35)

The repairman (or system) has two states, free s0 and busy s1 . Suppose that the system is in the free state at time t0 , then its subsequent behavior does not depend on its previous history, since the jobs arrive independently. The probability p01 of the system going from state s0 to state s1 during a small interval of time t is αt + o(t). Hence, the rate of the transition from s0 to s1 equals α. Now, suppose the system is in busy state s1 , then, the probability p10 (t) of the system gonging from the busy state s1 to the free state s0 after a time t is just the probability that the repair service will fail to last another t time units. Thus, suppose that at time t the service has already been in progress for exactly τ time units, then p10 (t) = 1−P{T > τ + t|T > τ}

Since

= 1−

κt = i0 Eξ(t) the parameter κ is the average number of arrivals of the infected nodes at the quarantine queue occurring per unit time. Here, i0 denotes the number of infections initially present. The quarantine process described by Equation (35) converges to a binomial process for steady state and, in general, for smaller network sizes with relatively higher probability of infections, P{q(t) = k} =

Q p(t)k [1−p(t)]Q−k , k

(37)

P{T > τ + t} P{T > τ}

(38)

Hence, regarding Equation (37), we get the transition rate out of the repair state as p10 (t) = 1−

e−µ(τ+t) = 1−e−µτ e−µτ

(39)

As already noticed, the system can be described by a Markov process satisfying the conditions

k = 0, 1, . . . , Q

p01 (t) = 1−p00 (t),

p10 (t) = 1−p11 (t)

(40)

(36) Here, p denotes the probability of ﬁnding an infected node in a series of identical experiments, Q denotes the number of consecutive Bernoulli trials in which there are precisely k infectious nodes found. 5.5. Transition probabilities of quarantine-to-repair state The quarantine process deals with diagnose and disinfection of the infected computers, leaving the repair to an overall recovery process (next Section) by the repair service. Requests for recovery of the quarantined nodes are modeled as a random ﬂow of service calls arriving at the repair center, where the incoming trafﬁc is of the Poisson type described above, with average density κ. It generally takes a random time unit to repair an infected node. Also considering the 230

where p00 (t) denotes the probability of staying in free state and p11 (t) denotes the probability of staying in busy state. Moreover, let

α00 = −α α10 = µ

α01 = α α11 = −µ

(41)

and pij (0) =

1 0

if j = i if j = i

(42)

Assume that rate of the transition out of state si is denoted by αi and from state si to state sj is denoted by αij . That is, during such a transition, the number of entities i changes to j. Since completions of the service times are exponentially distributed with parameter µ, from Equation (39) we have p10 (t) = 1−e−µt = µt + o(t)

(43)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

Hence, the transitions probabilities pij are such that 1−pii (t) = αi t + o(t)

i = 1, 2, . . .

pij (t) = αij t + o(t)

j = i, ∀(i, j ≥ 1)

(44)

and αii = −αi , then, considering the initial conditions given by (41) and Equation (42), the transition probabilities will satisfy two sets of linear differential equations, i.e., forward (FKE) and backward (BKE) Kolmogorov equations, respectively: FKE : p ij (t) =

pik (t)αkj

5.5.1. Steady state of the quarantine queue. We assume that, due to having a single repairman, the quarantine process works as a birth--death (Markov) process, where birth and death represent the arrival at and departure from the quarantine service, respectively. It should be noted that, conditions and expressions can be easily derived for the case of multiple repairmen based on the Markovian approach. The process is speciﬁed by birth rates α (entering quarantine) and death rates µr (leaving quarantine while entering recovery). Let pn denote the equilibrium probability of state n. Since state (number in the system) transitions follow Markovian process, their ﬂow rates must balance in a steady state, similarly for each pair of states as:

k

BKE : p ij (t) =

αpn−1 = µr pn , αik pkj (t),

∀(i, j ≥ 1)

(45)

k

so that pn =

Hence, using Equation (40) and deﬁnitions (41), Equation (45) will evolve as

= −αp00 (t) + µ[1−p00 (t)] = α[1−p11 (t)]−µp11 (t)

+ (α + µ)p00 (t)

α=

p 11 (t)

+ (α + µ)p11 (t)

p00 (t) =

p11 (t) =

µ 1− α+µ 1−

α α+µ

e

−(α+µ)t

(47)

Assuming the initial conditions p00 (0) = p11 (0) = 1, we can solve Equation (47) to give steady state probabilities as

α α+µ

α p2 = p1 = µr ...

pn =

n

α µr

1=

∞ n=0

1 = p0

p0

(52)

∞ α n n=0

α n ∞

µr

and thus,

(49)

p0 = 1− σ = α+µ µ ζ= α+µ α ρ= α+µ

2

p0

pn =

n=0

(48)

Deﬁning,

α µr

Since the steady state probabilities must sum to 1 as

µ + α+µ

e−(α+µ)t +

α p1 = p0 µr

=

(46)

i.e., again from Equations (40), (41), and (46) we have µ=

(51)

= α01 p10 (t) + α11 p11 (t)

p 00 (t)

α pn−1 µr

The probability of having n machines under disinfection is denoted by pn . The recurrence formula of moving from pn−1 to pn can be used to determine the steady state probability space :

p 00 (t) = α00 p00 (t) + α10 p01 (t) p 11 (t)

αpn = µr pn+1

µr

p0 (53)

=

p0 1− µαr

α µr

which is the proportion of the time the system is idle. Thus, for n > 0 we obtain the probability, pn , that there are n machines under quarantine (queue + disinfection) in the system as n n α α α pn = p0 = 1− µr µr µr

we write the state transition probabilities as p00 (t) = (1−ζ)e−σt + ζ

5.6. Recovered-to-healthy state

p11 (t) = (1−ρ)e−σt + ρ p01 (t) = 1−[(1−ζ)e−σt + ζ] = 1−p00 (t) p10 (t) = 1−[(1−ρ)e−σt + ρ] = 1−p11 (t)

(50)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Assume that we have now many hosts in the recovery queue awaiting repair (i.e., OS patch and update) operations. Following the disinfection, some necessary patches and updates of the operating system and other applications 231

Internet epidemiology

S. Kondakci and C. Dincer

running on that system must be performed before harnessing the host. Total repair times are exponentially distributed random time intervals, assuming that each host waits for T random time units and requires another random time unit v to be repaired, then, P{T ≥ v} = e−µh v will be used to determine the probability of the total time required for the recovery (repair) operation. The times between repairs (sojourn times), S, is a Poisson process with the exponential distribution parameter µh , which is indeed the average ﬂow rate from recovered to healthy state. Thus, the continuous random variable S has the probability density function

machine i and its repair time are exponentially distributed random variables with rates λ and µ, respectively. This is the classical machine repairmen problem, [46--48]. Assume that each repair occurs as a Markovian process at rate µ min[N(t), R(t)], where N(t) is the number of machines failed and awaiting repair at time t, and R(t) = s is the number of repairmen on duty. Thus, this simple birth--death Markov process, wherein jumps in state N(t) occur at exponentially distributed time intervals: P{N(t + t) = i + 1|N(t) = i} = (n−i)λt + o(t) = λi t + o(t) P{N(t + t) = i−1|N(t) = i} = min(i, s)µt + o(t) = µi t + o(t)

fS (t) = µh e−µh t

(55)

and the mean time between repairs is E(S) =

1 µh

Utilization of the repair system deﬁned as the proportion of time the repair system is busy, is given as ρ = 1−p0 , where p0 is the proportion of time the system is idle. Assuming the steady state, the arrival rate at the repair service, µr , is equal to departure (or repair) rate, µh . Thus, from the queueing theory, we have the following balance equations leading to rate-in = rate-out in the steady state. µr = 0p0 + µh (1−p0 ) µr p0 = 1− µh µr ρ= µh

p10 (t) = 1−p11 (t) = 1−e−µh t = µh t + o(t) (54)

This type of a random process can be modeled as a birth-death process as a special case of Markov chains where the states represent the current size of the related population and the transitions are limited to births and deaths. In the REM, the arrival of disinfected machines at the repair service is represented by a birth and a repaired + updated machine is denoted by a death process. The transition rates of the general birth--death are given by: pij =

λ i

µi

µ i + λi i

j = i + 1(birth = arrival) j = i−1(death = departure) j = 0 (no change = under repair) i = j = 0(0state is absorbing)

where λi and µi denote the ith birth and death rates, respectively. In order for system to reach i + 1 from i for the ﬁrst time it must either do so on the ﬁrst transition out of state i, or else drop back to i−1, return to i and try again. The arrival rate from quarantine, µr , and departure rate to healthy, µh , parameters affect the ﬁnal repair rate of the repair facilities. Obviously, the nodes are repairable systems with multiple states, up, down, and repair. Consider a network with m machines, which are maintained by s repairmen. The time to failure (up time) of 232

where, λi = µr (rate to recovery) and µi = µh (rate to health) are the general transition rates of arrival and departure corresponding to rates in state i as µr and µh , respectively. Further readings can be found in Ref. [49,50], where [49] contains a comprehensive treatment of stochastic models related to system performance and reliability, and [50] presents a Markov model for multistate repairable systems. Consequently, the transition rate to the healthy state given by Eq. (55) is in accordance with Equations (37,38,39) and (40): (56)

5.6.1. Throughput of repair facilities. The capacity of the recovery system depends mainly on the number of repairmen; assigning single or multiple repairmen dominate the overall efﬁciency of the recovery service. For the case of single-repairman and n repair requests, we have the overall throughput ratio as

ρ=

µr µh

n (57)

and, therefore, from the queueing theory we have the probability of having n machines under repair is denoted by pn : pn = ρn p0 ,

for n = 1, 2, . . .

(58)

where, implicitly, 1

p0 = 1+

=

∞

= ρn

∞

!−1 ρn

n=0

n=1

1 1−ρ

−1 = (1−ρ)

(59)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

Thus, the steady-state probability for the state n is pn = ρn p0 = ρn (1−ρ),

for n = 1, 2, . . .

Obviously, the ﬁrst term, (60) Lq =

Regarding the case of multiple-repairmen, the overall throughput ratio factor is (µr /µh )n ρn = , n!

for n = 0, 1, . . . , s

(61)

where s is the number of repairmen available, and (µr /µh )s ρn = s! =

µr sµh

n−s

(µr /µh )n , n = s, s + 1, . . . s!sn−s

p0 =

n!

n=0

(µr /µh )s + s![1−(µr /sµh )]

#−1 (63)

Thus, the probability pn is

pn =

(µ /µ )n r h p0 if 0 ≤ n ≤ s n!

n (µr /µh ) p

s!sn−s

if n ≥ s

0

Based on the Erlang distribution and setting ρs = µr /sµh , we get the expected number of machines in the repair system (queue + repair) for a multiple-repair facility as L=

p0 (µr /µh )s ρs µr + s!(1−ρs )2 µh

(64)

M = 100, N = 98, Repairmen = 4 100 90

λ = 0.5, µ = 8

80 70 60

λ = 1, µ = 10

50 40 30 λ = 1, µ = 4

20 10 0

0

1

2

3

4

(65)

gives the expected queue length waiting for repair. Figure 7 shows the simulation results of three experiments using random arrival and recovery times with a repair capacity of four repairmen. Initially we have M = 100 machines, N = 98 of which are up and running, so that two repairmen are idle and two begin recovering the failed machines, immediately, at time t = 0. Mean arrivals at the recovery queue is represented by λ and the mean recovery time is depicted by µ time units.

(62)

Assuming µr < sµh , which is often a desired case, then

" s−1 (µr /µh )n

p0 (µr /µh )s ρs s!(1−ρs )2

5

Time

Figure 7. A service facility with four repairmen, where the infection and repair rates are random. Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

6. EXPERIMENTS We have set up and run two sets of experiments each conﬁgured differently to represent systems with constant and random ﬂow rates, respectively. In this combined failure and repair model, failed machine arrivals at the quarantine service occur in a random fashion, i.e., arrivals enter at discrete times 0, 1, . . . , k, only and service completions occur only at such times. Related to the discussion before, changes in the ﬂow densities can brieﬂy be expressed as H → S : β(t + 1) = β(t) × δs (t) S → I : γ(t + 1) = γ(t) × δi (t) I → Q : κ(t + 1) = κ(t) × δq (t) Q → R : α(t + 1) = α(t) × δr (t) R → H : µ(t + 1) = µ(t) × δh (t) The evolution of the propagation and extinction processes of random scan worms is mathematically justiﬁed in Section 5. The results obtained by the experiments and simulations also conﬁrm to the theoretical analysis given therein. The following ﬁgures, Figures 8 and 9, illustrate results of the experiments based on constant arrival and departure rates within each state of the recurrent model. As brieﬂy outlined, the rates of change for a network of 100 nodes using the parameters with the ini− tial values of β = 0.21, ← s = 0.03, γ = 0.76, κ = 0.62, α = 0.83 and µ = 0.88 are shown in Figure 8a. Figure 8b illustrates, the effect of the increased susceptibility and constant latent times of the same experiment, but for each state, infection, quarantine, and recovery processes are changed to behave in a slightly random manner. Parame− ters for this experiment are β = 0.81, ← s = 0.03, γ = 0.76, κ = 0.62, α = 0.83 and µ = 0.88. Changing the infection, quarantine, and recovery parameters causes signiﬁcant changes in the states. In particular, the inclusion of delay times in passages (latent times) leads to a cascade of latencies in between the ﬁve states. The result of the experiment 233

Internet epidemiology

S. Kondakci and C. Dincer

(a) Lower Susceptibility & Zero Latent Times

(b) Higher Susceptibility & Random Process Times

14

35

Susceptible

12

30

Number of Machines

10

25

Infected

8

20

Quarantined

6

15

Recovered

4

10

2

0

5

0

10

20 TIME

30

40

0

0

5

10

15 TIME

20

25

30

Figure 8. The effect of constant arrival and departure rates; (a) lower susceptibility, (b) higher susceptibility, random infection, quarantine and recovery processes for a network of 100 nodes. Random Passage (Latent) Times 25

Susceptible

Number of Machines

20

Infected

Quarantined

Recovered

15

10

5

0 20

25

30

35

40 TIME

45

50

55

60

Figure 9. The result of inserted latent time for infection, and effect of higher delays for quarantine and recovery services; ˇ = 0.86, RSS = 0.028, = 0.76, = 0.63, ˛ = 0.83 and = 0.88.

− with initial parameters β = 0.86, ← s = 0.028, γ = 0.76, κ = 0.63, α = 0.83, and µ = 0.88 is depicted in Figure 9. Here, we examine the effect of increased latent time before a machine becomes infectious, i.e., we have increased infectious-to-quarantine time and quarantine-torecovery delays to monitor the overall effect. 234

6.1. Modified recurrent model All the experiments thus far have assumed constant mean arrival rate and departure rate, regardless of the latent time delays in the infection process and ﬂuctuations in quarantine and repair times. First of all, a node does not become Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

30

25

I Q

Random Latency and Passage Times

Number of Machines

20 R

Infected 15 Quarantined

10 Recovered 5

0 0

20

40

60

80

100 TIME

120

140

160

180

200

Figure 10. Random rates for a network of 100 nodes of REM using random latent time parameters and random process behaviors.

infectious/transmitter immediately an attack has occurred; it takes a random time, depending on the usage of the computer and the infection type, for a node to become a transmitter. Likewise, the repair facilities have a random throughput, depending on the repairman’s abilities and the size of damage of the system under repair. It is more realistic to have random arrival and departure rates, thus, we modify the system so that the related parameters are uniformly random. We must incorporate the random latent time periods and random service rates into the modiﬁed model. Figure 10 illustrates the output of the new modiﬁed model with the random parameters incorporated. It should be noted that, as shown in Figure 8, for the above parameters, the basic REM behaves in a signiﬁcantly different way compared to the advanced (randomized) model shown in Figure 10. However, after a long run of experiments with huge number of hosts, a steady state with similar shapes having signiﬁcant effect of latencies in between the states will be achieved. As shown in Figure 10, delays caused by the latent time during the infection period, user vigilance, and reduced recovery capacity leads to increased delays spreading over a longer time internal. Unfortunately, lack of real data makes it difﬁcult to compare the simulation results to real-life scenarios.

6.2. Optimization parameters It can be readily observed that infection and susceptibility rates in this propagation and extinction model play the Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

central role for the growth of repair requests while the quarantine and recovery rates affect the ﬁnal recovery process. It is often the fact that, once a machine in a network is infected the infection will immediately spread if the susceptibility rate is high enough, and hence, requests for machine repair will increase. If the repair service (quarantine + recovery) has a shortage of resources then the system may lock at some point. The recurrent system is typically controlled by two joint sets of parameters: 1. Susceptibility and infection parameters, δs and δi . 2. Discovery, quarantine, and recovery parameters δq , δr , and δh . It is obvious that susceptibility is mainly dependent on network topology, degree of connection, and frequency of the usage of machines. Since the usage of machines cannot be controlled for individuals, susceptibility can only be reduced by using virtually isolated networks and training of the users. Thus, the associated parameter δs can be controlled by two additional coefﬁcients, topology (e.g., the number of hosts affected by each other) and the user vigilance, respectively. The susceptible-to-infected parameter δi depends on δs , protection strength of the hosts, and the strength of the attacker. The infected-to-quarantine parameter δq , quarantine-to-recovered parameter δr , and recovered-to-healthy parameter δh depend on a combination of the user vigilance and capabilities of the repair facilities, and network administrators. These dependencies are 235

Internet epidemiology

reﬂected as a total delay in the recovery process, which is also illustrated in Figure 10.

7. CONCLUSIONS There exist numerous worm propagation models based on classical epidemic models, which by context are similar and mostly deterministic. Few of the recent models go deeper into details of stochastic approaches. We presented here a pure stochastic model covering a broad aspect of the Internet epidemiology, where the propagation and recovery models are discussed in details. Each unique propagation model should have a corresponding model for recovery that align with characteristics of the propagation in order to achieve better results. In contrast to bare worm propagation modeling, we need to focus more on accompanying recovery modeling. Here, we have presented two sets of unique stochastic models, a worm propagation model and a recovery model associated with it. Both the propagation and recovery processes were shown to be recurrent. It is important to determine worm propagation rate in a dynamic context. In many cases, all nodes in a network can be considered susceptible, if one is a transmitter. Most of the classical epidemic models rely on this assumption, however, in a realistic case, this might be critically misleading. Our model makes a clear distinction based on the protection level of the nodes combined with the strength of attackers, the recurrence parameters, infection, spread, and the recovery rates. Therefore, we have presented a more detailed realistic model consisting of the most fundamental parameters. Among them, the rates of susceptibility, quarantine, and infection, and capabilities of the recovery facilities played fundamental roles for conﬁguring the overall propagation, spread, and extinction rates. By modifying these parameters, one can easily capture the dynamics of a realistic epidemic picture of almost any population. In addition to the analysis presented here, we need to perform further studies on the presented model. It was observed that, short infection periods may be achieved by fast quarantine and recovery techniques or by the inherent system immunity. Extremely short infection periods are not considered as realistic infections. Therefore, it is important to determine whether the infection periods are abnormally short before ﬁtting the data to an experimental distribution. Heterogeneity of victim systems is an important issue to consider, which can be studied by modeling behaviors of the infection and recovery states with modiﬁed random parameters. This analysis is important for building a proper model with realistic latent time periods and worm characteristics. Furthermore, the model presented must be expanded to incorporate new births (recently added nodes), nonrecovered (dead) nodes, and exported agents (e.g., infected messages) to external networks. Due to the limited scope and size of the paper, we were unable to present the entire work here. Ideally, a holistic recurrent model for the entire Internet deserves a separate treatment. 236

S. Kondakci and C. Dincer

REFERENCES 1. Rohloff KR, Bas¸ar T. Deterministic and stochastic models for the detection of random constant scanning worms. ACM Transactions on Modeling and Computer Simulation 2008; 18(2): 1--24, DOI: http://doi.acm.org/ 10.1145/1346325.1346329. 2. Kondakci S. Epidemic state analysis of computers under malware attacks. Simulation Modelling Practice and Theory 2008; 16(5): 571--584, DOI: 10.1016/ j.simpat.2008.02.011. 3. Rohloff K, Basar T. Stochastic behavior of random constant scanning worms. Computer Communications and Networks, 2005.ICCCN 2005. In Proceedings of 14th International Conference on 2005, 339--344, DOI: 10.1109/ICCCN.2005.1523881. 4. Nicol DM. The impact of stochastic variance on worm propagation and detection. In WORM ’06: Proceedings of the 4th ACM workshop on Recurring malcode. ACM: New York, NY, USA, 2006; 57--64, DOI: http://doi.acm.org/10.1145/1179542.1179555. 5. Anderson H, Britton T. Stochastic Epidemic models and Their Statistical Analysis, Lecture Notes in Statistics Springer: New York, 2000. 6. Avlonitis M, Magkos E, Stefanidakis M, Chrissikopoulos V. A novel stochastic approach for modeling random scanning worms. Informatics, 2009.PCI ’09. 13th Panhellenic Conference on 2009; 176--179, DOI: 10.1109/ PCI.2009.20. 7. Zou CC, Towsley D, Gong W. On the performance of internet worm scanning strategies. Performance Evaluation 2006; 63(7): 700--723, DOI: 10.1016/ j.peva.2005.07.032. 8. Mukhopadhyay B, Bhattacharyya R. Existence of epidemic waves in a disease transmission model with twohabitat population. Int. J. Syst. Sci. 2007; 38(9): 699--707, DOI: http://dx.doi.org/10.1080/00207720701596417. 9. d’Onofrio A, Biomathematical analysis and extension of the new class of epidemic models proposed by satsuma et al. (2004). Applied Mathematics and Computation 2005; 170(1): 125--134, DOI: 10.1016/j.amc. 2004.10.083. 10. Barth`elemy M, Barrat A, Pastor-Satorras R, Vespignani A. Dynamical patterns of epidemic outbreaks in complex heterogeneous networks. Journal of Theoretical Biology 2005; 235(2): 275--288, DOI: 10.1016/j.jtbi.2005.01.011. 11. Liljenstam M, Nicol DM, Berk VH, Gray RS. Simulating realistic network worm trafﬁc for worm warning system design and testing. In WORM ’03: Proceedings of the 2003 ACM workshop on Rapid Malcode. ACM Press: New York, NY, USA, 2003; 24--33, DOI: http://doi.acm.org/10.1145/948187.948193.

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

12. Moore D, Shannon C. Claffy K. Code-red: a case study on the spread and victims of an internet worm. In IMW ’02: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment. ACM Press: New York, NY, USA, 2002; 273--284, DOI: http://doi.acm.org/10.1145/637201.637244. 13. Moore D, Paxson V, Savage S, Shannon C, Staniford S, Weaver N. Inside the slammer worm. IEEE Security & Privacy 2003; 1(4): 33--39, DOI: http://dx.doi.org/10.1109/MSECP.2003.1219056. 14. Sharif MI, Riley GF, Lee W. Comparative study between analytical models and packet-level worm simulations. In PADS ’05: Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation. IEEE Computer Society: Washington, DC, USA, 2005; 88--98, DOI: http://dx.doi.org/10.1109/PADS.2005.5. 15. Daley D, Gani J. Epidemic modeling, an introduction, Cambridge University Press: Cambridge, U.K, 1999. 16. Brauer F, Castillo-Chavez C. Mathematical Models in Population Biology and Epidemiology, Springer: Berlin, 2001. 17. Diekmann O, Heesterbeek J. Mathematical Epidemiology of Infectious disease, Wiley Series in Mathematical and Computational Biology Chichester: Wiley, 2000. 18. Hethcote H. The mathematics of infectious disease. SIAM Review 2000; 42(4): 599--653. 19. Qing S, Wen W. A survey and trends on internet worms. Computers & Security 2005; 24(4): 334--346, DOI: 10.1016/j.cose.2004.10.001. 20. Ellis D. Worm anatomy and model. In WORM ’03: Proceedings of the 2003 ACM workshop on Rapid Malcode. ACM Press: New York, NY, USA, 2003; 42--50, DOI: http://doi.acm.org/10.1145/948187.948196. 21. Wong C, Bielski S, McCune JM, Wang C. A study of mass-mailing worms. In WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 1--10, DOI: http://doi.acm.org/10.1145/1029618.1029620. 22. Weaver N, Paxson V, Staniford S, Cunningham R. A taxonomy of computer worms. WORM ’03: Proceedings of the 2003 ACM workshop on Rapid Malcode. ACM Press: New York, NY, USA, 2003; 11--18, DOI: http://doi.acm.org/10.1145/948187.948190. 23. Shannon C, Moore D. The spread of the witty worm. IEEE Security & Privacy 2004; 2(4): 46--50, DOI: http://dx.doi.org/10.1109/MSP.2004.59. 24. Li P, Salour M, Su X. A survey of internet worm detection and containment. Communications Surveys & Tutorials, IEEE Quarter 2008; 10(1): 20--35, DOI: 10.1109/COMST.2008.4483668. 25. Sellke SH, Shroff NB, Bagchi S. Modeling and automated containment of worms. Dependable and Secure

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Internet epidemiology

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

Computing, IEEE Transactions on April--June 2008; 5(2): 71--86, DOI: 10.1109/TDSC.2007.70230. Wei S, Mirkovic J, Swany M. Distributed worm simulation with a realistic internet model. PADS 2005.Workshop on Principles of Advanced and Distributed Simulation, 2005; 71--79, DOI: 10.1109/PADS. 2005.7. Zou CC, Towsley D, Gong W. Modeling and simulation study of the propagation and defense of internet e-mail worms. IEEE Transactions on Dependable and Secure Computing 2007; 4(2): 105--118, DOI: http://dx.doi.org/10.1109/TDSC.2007.1001. Liu Z, Lee D. Coping with instant messaging worms -- statistical modeling and analysis. LANMAN 2007.15th IEEE Workshop on Local & Metropolitan Area Networks, 2007; 194--199, DOI: 10.1109/LANMAN.2007.4295998. Hatahet S, Challal Y, Bouabdallah A. Bittorrent worm sensor network: P2p worms detection and containment. 17th Euromicro International Conference on Parallel, distributed and network-based processing, 2009; 293-300, DOI: 10.1109/PDP.2009.61. Feng C, Qin Z, Cuthbet L, Tokarchuk L. Propagation model of active worms in p2p networks. ICYCS 2008.The 9th International Conference for Young Computer Scientists, 2008; 1908--1912, DOI: 10.1109/ICYCS.2008.237. Kotenko I. Framework for integrated proactive network worm detection and response. 17th Euromicro International Conference on Parallel, distributed and network-based processing, 2009; 379--386, DOI: 10.1109/PDP.2009.52. Yamada Y, Katoh T, Bista B, Takata T. A new approach to early detection of an unknown worm. AINAW ’07. 21st International Conference on Advanced Information Networking and Applications Workshops, 2007; vol. 1, 2007; 194--198, DOI: 10.1109/AINAW.2007.33. Costa M, Crowcroft J, Castro M, et al. Stopping internet epidemics. International Zurich Seminar on Communications, 2006; 86--89, DOI: 10.1109/IZS.2006.1649086. Chen S, Tang Y. Slowing down internet worms. In ICDCS ’04: Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS’04). IEEE Computer Society: Washington: DC, USA, 2004; 312--319. Xiong J. Act: attachment chain tracing scheme for email virus detection and control. WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 11--22, DOI: http://doi.acm.org/10.1145/1029618.1029621. Wong C, Wang C, Song D, Bielski S, Ganger GR. Dynamic quarantine of internet worms. In DSN ’04: Proceedings of the 2004 International Conference on 237

Internet epidemiology

37.

38.

39. 40.

41.

42.

43.

44.

238

Dependable Systems and Networks (DSN’04). IEEE Computer Society: Washington: DC, USA, 2004; 73--82. Weaver N, Hamadeh I, Kesidis G, Paxson V. Preliminary results using scale-down to explore worm dynamics. WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 65--72, DOI: http://doi.acm.org/ 10.1145/1029618.1029628. Porras P, Briesemeister L, Skinner K, Levitt K, Rowe J, Ting YCA. A hybrid quarantine defense. In WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 73-82, DOI: http://doi.acm.org/10.1145/1029618.1029630. Athreya K, Ney P. Branching processes 1972. URL, citeseer.ist.psu.edu/athreya99branching.html. Kalinkin AV. Final probabilities for a branching process with interaction of particles and an epidemic process. Theory of Probability and its Applications 1999; 43(4): 633--640, DOI: 10.1137/S0040585X97977203. Gmb HA. Description of w32/nimda (w32/nimda.eml) -malware March 2008. URL, www.avira.com/en/threats/ section. Albert RZ, Barab´asi AL. Statistical mechanics of complex networks. Reviews of Moderen Physics 2002; 74(1): 47--97, DOI: 10.1103/RevModPhys.74.47. Pastor-Satorras R, Vespignani A. Epidemics and Immunization in Scale-free Networks, Wiley-VCH, Berlin, 2004; DOI: 10.1007/978-3-642-10625-5 26. Aparicio JP, Pascual M. Building epidemiological models from R0 : an implicit treatment of transmission in

S. Kondakci and C. Dincer

45.

46.

47.

48.

49.

50.

networks. Proceedings of the Royal Society B: Biological Sciences, vol. 274 (1609), PubMed Central, 2007; 505--512, DOI: http://10.1098/rspb.2006. 0057. Volz E, Meyers LA. Epidemic thresholds in dynamic contact networks. Journal of the Royal Society Interface 2009; 6: 233--241, DOI: 10.1098/rsif.2008.0218. Archer A, Williamson DP. Faster approximation algorithms for the minimum latency problem. In SODA ’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2003; 88--96. Kogan Y, Choudhury G. Two problems in internet reliability: new questions for old models. SIGMETRICS -- Performance Evaluation Review 2004; 32(2): 9--11, DOI: http://doi.acm.org/10.1145/1035334.1035339 . Blum A, Chalasani P, Coppersmith D, Pulleyblank B, Raghavan P, Sudan M. The minimum latency problem. In STOC ’94: Proceedings of the twenty-sixth annual ACM symposium on Theory of computing. ACM: New York, NY, USA, 1994; 163--171, DOI: http://doi.acm.org/10.1145/195058.195125. Buzacott JA, Shanthikumar JG. Stochastic Models of Manufacturing Systems, Prentice Hall PTR: Englewood Cliffs, NJ, 07632, 1993. Cui L, Li H, Li J. Markov repairable systems with history-dependent up and down states. Stochastic Modellings 2007; 23(4): 665--681, DOI: http://10.1080/15326340701645983.

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

RESEARCH ARTICLE

Internet epidemiology: healthy, susceptible, infected, quarantined, and recovered Suleyman Kondakci* and Cemali Dincer Faculty of Eng. & Computer Sciences, Izmir University of Economics, Sakarya Caddesi No.156, 35330 Balcova--Izmir, Turkey

ABSTRACT This paper presents a recurrent epidemic model (REM) to explore the dynamics of Internet epidemiology through the phases of susceptibility to recovery. From both theoretical and practical standpoint, it has two main differences compared to the bare worm propagation modeling. In the ﬁrst place, it deﬁnes a unique stochastic model of a general infection spread. In the second place, it models the recovery process as a stochastic queueing system, which accurately partitions diagnose, quarantine, disinfection and recovery processes and complements it as a recurrent failure-repair management model, which is entirely unique. There still exists an open question to model propagation patterns of infections and accompanying recovery models needed for effectively managing the infected individuals. The REM model is a unique concept in determining the parameters for estimating the recovery efﬁciency of disrupted systems and for developing long-term recovery strategies under different epidemic situations. Existing infection and worm propagation models can also be used in cooperation with REM in order to analyse necessary quarantine and recovery processes. REM can also be applied for the accurate classiﬁcation of the phases in epidemic dynamics and the states of affected systems in general, and also be used as a guideline for developing stochastic simulations covering various types of systems with recurrent state dynamics in order to facilitate reliability analysis of the systems. Copyright © 2011 John Wiley & Sons, Ltd. KEYWORDS worm propagation modeling; recovery modeling; queueing; simulation; Internet epidemiology; stochastic modeling *Correspondence Suleyman Kondakci, Faculty of Eng. & Computer Sciences, Izmir University of Economics, Sakarya Caddesi No.156, 35330 Balcova-Izmir, Turkey. E-mail: [email protected]

1. INTRODUCTION We present here a new epidemic model for the management of lifecycle of infection phases, carrier propagations, and recovery management in large and scale-free networks such as the Internet. By its nature, epidemiology touches on several areas in which a vast amount of research dealing with epidemic modeling and simulations exists. However, apart from the model presented here, infection handling, recovery, and resource management models are difﬁcult to ﬁnd. Thus, compared to recent propagation models, we also present detailed approaches dealing with recovery modeling of failed systems. Recurrent epidemic model (REM) offers an extended set of state transition structure in order to analyse the necessary system recovery functionality. To achieve the desired accuracy, we deﬁne a complex stochastic model consisting of a sequence of stochastically dependent states. In REM, a transmitter node cannot remain as a transmitter forever, nor can a recovered node become immune indefinitely. Recovered nodes will always return to the healthy 216

population (state), where some may become susceptible, and consequently reinfected. The uniqueness of REM lies in the fact that it concretely deﬁnes the recovery phases of infected nodes and goes deeper into the analysis of the recovery phases in accordance with the infection and propagation phases. Although there exist several stochastic worm propagation models [1-6], the REM model is a pure stochastic system with ﬁve major states: healthy, susceptible, infected, quarantined, and recovered. These models, except [2], focus mainly on worm scanning and propagations, while REM focuses more on the quarantine and recovery processes. It is true that not all infected computers remain in the infected state inﬁnitely, but sooner or later return to healthy state, although they may become susceptible again. Unlike REM, many worm models do not consider recurrence and modeling of recovery operations, i.e., mostly, focus deeper on worm scanning strategies [7] and propagation modeling. Classical epidemic models tend to have limitations as most ignore details of recovery modeling, and in general omit the discussion Copyright © 2011 John Wiley & Sons, Ltd.

S. Kondakci and C. Dincer

of stochastic behavior of Internet epidemiology. Indeed, as substantiated in Ref. [1], despite the wide acceptance of classical deterministic models, the underlying propagation behavior of random scan worms has been shown to be stochastic, see [1] for further discussion on the classical Kermack--McKendrick model and the stochastic model considering random constant scan worms. Although the model deals with viral infections inside a computer network, it can also be applied to sparsely interconnected networks of various types of populations. Indeed, an analogy can be drawn between the biological epidemiology, e.g.,Ref. [8--10], and computer epidemiology. In relation to this, at present, a variety of epidemic models have been adapted to computer epidemics. Three of the most generally discussed are brieﬂy introduced in Ref. [11]. Malware carrier mechanisms (agents) transmit viruses to many individuals, often in a stochastic manner. The most widely encountered carrier mechanisms are e-mails, instant messengers, network ﬁleshares, scan worms, and computer programs shared among individuals over the Internet. We use here two sets of terminologies: (1) the terms user-agents, mailboxes, instant messages, and malware are interchangeably used to denote suspicious carrier mechanisms, (2) the terms node, host, vertex, machine, computer, and system are also interchangeably used. We deﬁne an infected node as a repairable system, since the node can be disinfected and restored to operate after a failure. Although there exist several types of infection strategies (peer-to-peer, P2P, ﬁleshares, e-mail attachments, steganographic contents, executable e-mail contents, and messengers), our discussions consider random scan worms. Though there exist various aspects of Internet worms that can be studied, such as worm intelligence, discovery, deployment, and defense, we focus more on the study of recovery characteristics of infections and resource management needed for the control of an eventual outbreak. This is necessary, because, under an epidemic outbreak it is important to understand which capabilities and strategies are required for the effective management of the epidemic outbreak. We need to explore time-dependent states of systems in order to determine functions representing the spread and recovery rates and queueing characteristics of infection discovery, quarantine, and recovery of the infected systems. The determination of these functions plays a central role in obtaining efﬁcient quarantine and recovery strategies in order to increase the operational reliability and overall system efﬁciency. REM deﬁnes a set of ﬂow parameters that represent the growth and extinction rates of infections. The ﬂow parameters focus on critical points of intervention in the infection and recovery pathway. Ideal protection solutions can be achieved by appropriately matching these ﬂow parameters to victim systems so that realistic worm propagation ﬁgures and effective management strategies can be obtained by conﬁguring these parameters. Anti-virus applications can customize and apply the ﬂow parameters to strengthen the worm discovery, disinfection, and protection mechaSecurity Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Internet epidemiology

nisms. Recovery facilities can also beneﬁt from this model by establishing effective resource allocations based on longrun model emulations. Earlier, it was assumed that infected individuals, at any given time, have ﬁxed and equal probabilities for infectious, quarantined, and recovered states. This assumption is true for many existing models [9], however, in reality, there is a random latent period between the time a node is visited by a worm and the time it becomes a transmitter. Likewise, the quarantine of an infected node might be subject to a random delay before it is cured. The delivery of the recovered systems to the healthy population is also subject to random delays. As also stated in Ref. [2], many models assume constant transition rates from susceptible to transmitter, and from transmitter to recovered state, i.e., these models consider homogeneous objects and infectors so that state transitions occur with a constant probability. In reality, attacked systems are non-homogeneous and it is impossible to accurately predict when an infected node will become a transmitter. The passage time from being infected to becoming transmitter (infectious) is called the latent time period, the length of which depends on the way worms operate and protection characteristics of the computers we use. For example, if a virus is deployed via e-mail attachments, and users on a given node do not access incoming e-mails at all, then there is an inﬁnite latent time period. On the other hand, some worms such as CodeRedII [12] and SQL-Slammer [13] may immediately infect vulnerable systems while they were scanned for infection. As discussed in Ref. [14], recently developed analytical models have been used to generate propagation trends which match these historic worm outbreaks but ignore network trafﬁc characteristics. Other models ignore the randomness of the latent time periods. However, in reality, we should not expect the ﬂow rates to be constant each time because of different network topologies, worm deployment techniques, operations, infection patterns, quarantine, and recovery conditions. In some cases, users do not read e-mails, or do not execute vulnerable applications for a considerable length of time. In these cases, the classical epidemic models, e.g., Susceptible--Infected--Recovered (or SIR) models [9], will fail to present an appropriate scheme. Another shortcoming of some existing models is that when a node recovers, it will stay immune to the virus forever. However, depending on level of user vigilance, it is quite common for the same Internet virus, to infect other computers repeatedly. The REM model does not directly consider defense strategies needed to slow down the spreading rate of worms, it rather models the central functions and parameters that can be used to specify characteristics of propagations and recovery processes. These functions can be applied to determining appropriate recovery strategies by using them at different points of the worm propagation and recovery chain in order to optimize the long-term immunization process. As presented here, the methodology of REM and its mathematical properties are analyzed and veriﬁed thoroughly, simulated and tested with relatively small-sized (ca. 100 nodes) network structures, however its relevance to 217

Internet epidemiology

real-world observations of extreme sizes need to be experimented more. Realistic distributions of infectious periods of some existing worms regarding the mobility within a network structure and among a variety of network structures should also be experimented in REM. The distribution of infection growth, infectious duration, and mean infection rates of some known worms within real-world settings need also be experimented with regards to metapopulation theory. Regardless of the theoretic tractability of a model, the effect of connectivity between spatially separated populations (metapopulation) on infection and recovery dynamics is an important factor that may even force the modiﬁcation of any existing propagation model, including REM, if detailed observations with real-world settings can be possible to perform. In order to assess the recovery efﬁciency of REM, some existing propagation models can also be incorporated into REM and experimented together.

1.1. Outline In the following, Section 2 presents a brief overview of related work and existing models, Section 3 introduces the structure of the REM model, Section 4 deals with the recurrent algorithm and state analysis of the model. Section 5 presents the theoretical details of the phases and the state transitions of REM. Experiments, simulations, and some results regarding the recovery and system efﬁciency are given in Section 6. Section 7 concludes the paper.

2. RELATED WORK Enormous effort has been invested in developing worm propagation models, e.g., Ref. [5,15--18], most of which have been derived from the classical SIS and SIR epidemic models, while others are more general and descriptive. Stochastic worm modeling, which can lead to more realistic worm management structures, has been studied more recently [2]. In a comprehensive work, presented in Ref. [1], a density-dependent Markov jump process model for random constant scanning (RCS) worms is presented. Paper [1] also discuss modeling and detection properties of RCS worms in a mathematically rigorous fashion. A hybrid deterministic/stochastic model for the observations of a worm’s scanning behavior is also presented by this valuable work. Another realistic stochastic epidemic model, based on a ﬁnite-Markovian process, is presented in Ref. [2], which mainly determines state transition dynamics of systems for estimating infection, quarantine, and recovery rates of susceptible systems. A survey and trends on Internet worms is presented in Ref. [19], which discusses the concepts and research situations of Internet worms, their execution mechanism, scanning strategies, propagation models, and the critical techniques used for the prevention. An anatomy of worms and analysis of their potency within a speciﬁc network is presented in Ref. [20]. A study of mass-mailing worms is given in Ref. [21], a taxonomy of computer worms 218

S. Kondakci and C. Dincer

is presented in Ref. [22], and a global view of the spread of the Witty worm and its powerful features are dealt with in Ref. [23]. A survey and comparison of Internet worm detection and containment schemes are presented in Ref. [24], where the authors analyze and compare different detection algorithms based on the worm characteristics by identifying the type of worms that can and cannot be detected by these schemes. A Galton--Watson branching process model for characterizing the propagation of RCS worms is considered in Ref. [25], which deals with both uniform scanning worms and preference scanning worms. It also presents the development of an automatic worm containment strategy to prevent the spread of a worm before an initial outbreak occurs. Models are often accompanied with a simulation in order to study various parameters, attack scenarios, and impacts. However, as stated in Ref. [26], a realistic model simulation covering all aspects of the Internet worm propagation requires extensive resource usage, mainly with severe limitations to run on a single CPU system. In general, two main aspects have been studied so far, worm propagation modeling and defense strategies dealing with various worm types (i.e., scan worms and user activated viruses), but recovery modeling has not been necessarily considered. Scan worms are active codes that automatically scan and infect systems in a branching manner. User activated viruses infect victims by user intervention, which often come as e-mail attachments, or being injected into P2P ﬁleshares and in instant messages. These aspects are dealt with in a number of studies, for example, Ref. [27] considers an e-mail worm simulation model that accounts for the behaviors of e-mail users, including frequency of e-mail readings and the probability of opening an e-mail attachment. To study topological impacts, [27] compares email worm propagation on power-law topology with worm propagation on two topologies: small-world topology and random-graph topology. It concludes that the impact of the power-law topology on the spread of e-mail worms is mixed, i.e., e-mail worms spread more quickly on a power-law topology than on a small-world topology or a random-graph topology, but immunization defense is more effective on a power-law topology. Instant messengers and P2P ﬁleshares are increasingly becoming popular throughout the Internet community. Worm propagation through these systems are also intensifying. A statistical modeling and analysis considering instant messaging worms is presented in Ref. [28] and a P2P worm detection and containment approach is given in Ref. [29]. The topology of P2P networks has an important effect on P2P active worm spreading, however, it is very difﬁcult to model this process [30]. For this reason, so far only a very limited number of propagation models considering active worms in P2P networks based on discrete-time methods is proposed [30]. We also need effective proactive worm detection and response approaches in order to cope with increasing number of worm types and attack scenarios, e.g., Ref. [31] considers an integrated proactive framework for defense against spreading worms through the Internet. As it states, Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

the framework is intended for worm detection by recognizing the actions on scanning of hosts in a network, where the containment of worm spreading is performed by limiting and blocking the network packets transmitted by infected hosts. Early detection of unknown worms must be NP-hard, in relation to this, [32] proposes a new method for detecting unknown worms by using hop number distribution of network packets received by a host. It can be intuitively veriﬁed that preventing Internet epidemics can be very difﬁcult. An end-to-end architecture called Vigilante is proposed in Ref. [33] to decelerate Internet epidemics. Vigilante relies on collaborative worm detection at end hosts, but does not require hosts to trust each other. There are several practical defense techniques used to slow down the propagation of worms, e.g. [34], where a distributed anti-worm architecture is proposed to automatically slow down or suspend worm propagation. An automated e-mail virus detection and control scheme using an attachment chain-tracing technique, based on classical epidemiology, is presented in Ref. [35]. An analytical model supported with simulations is presented in Ref. [36], which analyzes different deployment strategies of rate control mechanisms in order to limit the contact rate of worm trafﬁc. An interesting work applying scale-down techniques for approximating global Internet worm dynamics by shrinking the effective size of the network under consideration is given in Ref. [37], where the Slammer worm was simulated to experiment with the worm dynamics. A hybrid quarantine technique is used in Ref. [38] to study strengths and weaknesses of two complementary worm quarantine strategies under various worm attack proﬁles.

3. OVERALL STRUCTURE OF REM We deﬁne ﬁrst an atomic model in which only a single node is considered. A Bayesian network representation describing the causal behavior of the single node infection, which was proposed in Ref. [2], is given in Figure 1. A susceptible node (S), which has connections to some infectious nodes (I), will be infected, depending on the strength of the attacks and the protection mechanisms of

Figure 1. The lifecycle of a single node infection, where depicts a clean node. Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Internet epidemiology

the node. If discovered, the infected node will be quarantined (Q), or it can further become a transmitter (T), which will then become an active worm carrier and propagate the infection through all vulnerable contacts. Consequently, this branching process will continue spreading throughout the Internet until an extinction state. A susceptible node, if protected, will stay healthy (H). A quarantined node will either (most probably) become healthy, stay quarantined, or become transmitter if it could not be recovered. In the following, we present the main model that can be applied to a wide range of epidemic structures with queueing characteristics. A stochastic model is an appropriate approach for this purpose. We assume that all nodes in a given subnetwork (or network) are protected to some extent, assuming, further, that there is at least one infected node in a given network. The infected node is potentially able to spread the virus to neighboring nodes if they are not vaccinated. This means that, at any time, some nodes can be susceptible or contain viruses that are able to infect others. As shown in Figure 2, REM consists of four major phases describing the main states of the network nodes, which are deﬁned to model: (1) susceptibility, (2) infection, (3), quarantine, and (4) recovery characteristics. The protection level against infections affects the susceptibility of the system while the strength of the worm affects the infection phase. The recovery characteristics are mainly dependent on the infection detection skills, extinction methods, inherent anti-virus mechanisms, and the quarantine techniques. There are two important parameters; susceptibility and vigilance of quarantine techniques (or service). The number of contacts with infected nodes and the protection level (or the immunity) of the nodes determine the susceptibility parameter. The susceptibility level of any node depends on the contacting node’s status (whether it is infectious or it has links to infectious nodes) and the following factors: • The infectious node is accessed/contacted via its local ﬁle shares; viruses can be hidden in network folders and ﬁles that are shared among computers. • The infectious node is accessed via P2P ﬁle shares; viruses can be hidden in external network folders and in ﬁles which are shared over P2P connections. • E-mail agent of the infectious node sends infected mail messages to recipients that accept the messages and inadvertently activate the attached viruses. • Scan worm penetrations via the infectious node exploit known vulnerabilities on victim computers’ applications, e.g., the Slammer worm [13]. The vigilance parameter is mostly related to failures caused by human, administrative skills, and the effectiveness of the automated protection systems, if any. As will be detailed later in Section 5, the recurrent model also considers necessary latent time intervals between the subsequent states. For instance, there is a random latent time period during which a system may be infected but not become a virus transmitter. There is yet another delay time 219

Internet epidemiology

S. Kondakci and C. Dincer

which occurs before a system is delivered to the healthy population after it was recovered. For the analysis of the states, we have two classes of stochastic processes: one dealing with the susceptible and infectious populations and the other one dealing with the quarantined and recovered populations. Both of the processes are considered in detail throughout Section 5. Here, as an example, we brieﬂy present the deﬁnition of the stochastic behavior of a susceptible population and assume the same procedure for the evolution of the remaining populations. Let H be the total number of healthy hosts also containing susceptible ones, h0 be the number of hosts randomly selected hosts at time t = 0, and the random variable ξ(t) be the number of successes in a series of h0 Bernoulli trials with probability of success p(t). The values of h0 and p(t) is such that the distribution of ξ(t) is very accurately a Poisson distribution, i.e., the probability of ﬁnding exactly S susceptible hosts is given by βS −β P{ξ(t) = S} = e , S = 0, 1, 2, . . . , S! where β = Eξ(t) = h0 p(t)

(1)

is the average number of hosts found susceptible in t time units. Or, for smaller population sizes with higher probability compared to the above (Poisson) process and for S = 0, 1, . . . , H, we have a binomial distributed probability for the susceptibility growth P{ξ(t) = S} =

H p(t)S [1−p(t)]H−S S

(2)

Equations (1) and (2) yield for lim H → ∞, lim β → 0, lim βH → δs

t→∞

t→∞

t→∞

4. THE RECURRENT ALGORITHM As can be seen from the following state equations and Figure 2, the model has a recurrent characteristic having the major property of returning the recovered systems to the healthy population, which can then be susceptible and become infected by some other virus types. The recurrent model is described by the following sate equations that modify the related population during the time

Figure 2. Diagram of the REM model representing each state of spread and extinction processes.

220

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

respective units τH = t + H , τS = t + S , τI = t + I , τQ = t + Q , and τR = t + R :

h(τH ) = h(t) +

← − [ h (u) s(u) + µ(u) r(u)−β(u) h(u)]du

t

τS

s(τS ) = s(t) +

τH

τI

[γ(u) s(u)−κ(u) i(u)]du t

q(τQ ) = q(t) +

τR

[α(u) q(u)−µ(u) r(u)]du

(3)

t

The stochastic rate parameters deﬁning the state transitions are as follows: • • • • • •

(4)

h (t) = h(t)β(t)δs (t)

t

r(τR ) = r(t) +

r(t + t) = r(t) + qr(t)−rh(t)

τQ

[κ(u) i(u)−α(u) q(u)]du

q(t + t) = q(t) + iq(t)−qr(t)

These equations can be easily veriﬁed by considering changes in input and output values of the states. To do so, we deﬁne ﬁrst the leakage of a state as the number of entities emanating from it during t, and increment as the number of entities entering the state. Hence, the leakages of the ﬁve states are as follows:

← − [β(u) h(u)−[ h (u)−γ(u)]s(u)]du

t

i(τI ) = i(t) +

i(t + t) = i(t) + si(t)−iq(t)

healthy to susceptible β(u), ← − susceptible to healthy h (u), susceptible to infected γ(u), infected to quarantined κ(u), quarantined to recovered α(u), and recovered to healthy µ(u)

State parameters h(t), s(t), i(t), q(t), and r(t) contain the instantaneous number of nodes (entities) in the transition epoch u for the related state. The state equation set given in (3) can be represented in a simpliﬁed discrete form. That is, the number of entities found in each state during an interval (t, t + t] will use the discrete-time proportions hs(t), sh(t), si(t), iq(t), qr(t), and rh(t). Setting t equal for all states as the time epoch, we can write a discrete model expressing the states of the REM algorithm deﬁned by Equation (3). Thus, By referring to Figure 2 and ignoring all constants (δ{s,i,q,r,h} ) and ﬂow-back parameters ← − − ← − − ← ← − h ,← s , i ,← q, − r , and H , the number of entities found in each state during t + t can be expressed in the form of state(t + t) giving the equation set as h(t + t) = h(t) + sh(t) + rh(t)−hs(t) s(t + t) = s(t) + hs(t)−sh(t)−si(t)

← − s (t) = s(t)γ(t)δi (t) + s(t) h (t) ← − = s(t)[γ(t)δi (t) + h (t)] i (t) = i(t)κ(t)δq (t)

q (t) = q(t)α(t)δr (t) r (t) = r(t)µ(t)δh (t)

(5)

Let us now consider the instantaneous increment at time t in the susceptible population illustrated in Figure 3. At time t, the proportion of the healthy population moving to the susceptible population is given as hs(t) = h(t)β(t) The increment s(t) in the susceptible population at time t is (Figure 3) s(t) = h(t)β(t)δs (t)

(6)

Thus, the number of susceptible nodes at time t + t will be sum of increment s(t), decrement s (t), and the current value s(t) as: s(t + t) = s(t) + s(t)− s (t)

(7)

By setting the delta parameters δ{s,i,q,r,h} = 1 and plugging s (t) from Equation (5) into Equation (7), we get ← − s(t + t) = s(t) + s(t)−s(t)γ(t)−s(t) h (t)

Figure 3. Change in rates between healthy and susceptible populations states of REM.

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

221

Internet epidemiology

S. Kondakci and C. Dincer

Substituting s(t) with Equation (6) and, again, setting the delta parameter δs (t) = 1, we obtain ← − s(t + t) = s(t) + h(t)β(t)−s(t)γ(t)−s(t) h (t) (8) Since, hs(t) = h(t)β(t) si(t) = s(t)γ(t) ← − sh(t) = s(t) h (t) Equation (8) becomes s(t + t) = s(t) + hs(t)−sh(t)(t)−si(t)(t) = s(t) + s(t)− s (t) which veriﬁes the solution for the susceptibility part of the equation set (4). Hence, applying the same procedure as above, the numbers in each of the remaining populations can be found to be i(t + t) = i(t) + i(t)− i (t)

Figure 4. Effect of high initial susceptibility and rapid quarantine process.

q(t + t) = q(t) + q(t)− q (t) r(t + t) = r(t) + r(t)− r (t) h(t + t) = h(t) + h(t)− h (t)

(9)

number of contacts. The recovery process involved a single repairman and random discovery of failed machines. The initial parameters were set as β = 0.7999, γ = 0.000126, κ = 1.0, α = 0.1613, µ = 0.09677.

We can simplify this model to a classical deterministic model as dH dt dS dt dI dt dQ dt dR dt

4.1. Changes in populations

= µR−βH = βH−γS = γS−κI = κI−αQ = αQ−µR

(10)

A corresponding Kermack--McKendrik model describing a disease propagation with ﬁnite-time immunity is deﬁned as dS = −βSI + δR dt dI = βSI + γI dt dR = γI−δR dt

(11)

As shown by Equations (3), (4), and (10), compared to existing models (e.g., Equation 11), REM contains two additional states explicitly dealing with quarantine and recovery operations. A simulation result is shown in Figure 4, where the initial susceptibility β and infection γ parameters are chosen for a homogeneous network with constant 222

During a small time interval, a proportion of the healthy population with the ﬂow rate β ﬂows to the susceptible group, see Figure 2. Similarly, a portion of the susceptible population will become infected with a ﬂow rate of γ. A proportion of the infected population will be delivered to quarantine with a ﬂow rate of κ, and a fraction of the quarantined group will ﬂow to the recovered group with a rate of α. Finally, the recovered nodes will be tested and delivered back to the healthy population at the rate of µ. These recursive operations produce the current number of entities that are kept in ﬁve variables for expressing the states of the model; where h(t) denotes the current number of healthy nodes, s(t) the number of susceptible nodes, i(t) the number of infected nodes, q(t) the number of quarantined nodes, and r(t) denotes the number of recovered nodes. Hence, there is a small proportion of each population changed in a small time interval t according to hs(t) = h(t) β(t) ← − sh(t) = s(t) h (t) si(t) = s(t) γ(t) iq(t) = i(t) κ(t) qr(t) = q(t) α(t) rh(t) = r(t) µ(t)

(12)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

The equilibrium conditions of the rates with regard to ﬂow-back parameters are normalized as follows:

State space = {H, S, I, Q, R} From → To Jump rate Description

← − H (t) + β(t) = 1 ← − ← − s (t) + h (t) + γ(t) = 1 ← − i (t) + κ(t) = 1

H →S

← q−(t) + α(t) = 1 ← − r (t) + µ(t) = 1

S→I

(13)

Recall that, at any time epoch in a network, we have both susceptible and infected nodes. Some of these infected nodes have contacts with some other susceptible nodes, then a suitable contact with any of the infectious nodes can infect the susceptible ones if not protected. Assume also that each infective node has a constant number of contacts with the susceptible nodes per epoch (e.g., per hour), a random number of the infected hosts are quarantined, and a fraction of the quarantined group will recover during a ﬁnite number of epochs. These events are modeled as stochastic functions (or states) in the model, see Section 5 for details. Obviously, the model is comprised of a chain of interdependent state functions, where each function performs its tasks in the propagation chain of a state machine. Each function receives its inputs from its predecessor and sends its output data to the next function in the chain, see Figure 2 and Equation (4). As can be seen, the model is a ﬁnite state machine where the operations are recursive, and state changes are stochastic. It can be easily veriﬁed that the output of each of the states will ﬂuctuate during the iterative processes performed for infection, quarantine, and recovery operations. 4.2. State control parameters In order to construct a realistic epidemic system model, we deﬁne a set of ﬂow control parameters to study the dynamic characteristics of the propagation, i.e., infection, annihilation, quarantine, extinction, failure, and repair characteristics of the system. Thus, to express such a stochastic system, state rates will be described by these normalized equations that modify the rates at each time slot t + t: β(t + t) = [hs(t) δs (t)] ≤ 1 γ(t + t) = [si(t) δi (t)] ≤ 1 κ(t + t) = [iq(t) δq (t)] ≤ 1 α(t + t) = [qr(t) δr (t)] ≤ 1 µ(t + t) = [rh(t) δh (t)] ≤ 1

Table I. State transitions and the associated ﬂow parameters of the recurrent states.

(14)

Each of these equations determine the individual ﬂow densities between consequent states so that changes in the states occur in a normalized form from time t to t + . The non-negative iid random parameters (rates), β, γ, κ, α, and µ, express stochastic behaviors of rates at which the state Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

I → {Q, R}

R→H

ˇ ← − h ← − H ← − i ← − s ˛ ← q− ← − r

Healthy to susceptible Susceptible to healthy Healthy to healthy Susceptible to infected Infected to infected Susceptible to susceptible Infected to quarantined Quarantined to recovery Quarantined to quarantined Recovered to healthy Recovered to recovered

Flow ctrl. ıs --ıi --ıq ır -ıh --

transitions occur. The delta parameters δ{s,i,q,r,h} control the ﬂow density between the states in the chain. Table I summarizes the state transitions and related parameters of the entire model shown in Figure 2. The numbers in the states are modiﬁed by the system (state transition) probabilities described in Section 5, which further depend on the individual population size, state transition probabilities, and some environmental factors such as user vigilance and repair service capabilities. The description of the delta parameters are given below. Initial susceptibility parameter δs : Openness of a system characterized by the level of border protection, e.g., e-mail server protection, the number of connections to vulnerable hosts, and average connectivity of a node with infectious nodes, are the major factors that affect the initial susceptibility of a network. This parameter regulates the rate of susceptibility and epidemic growth. Generally, this parameter is useful in determining the epidemic threshold and the potential of the incoming threats. In other words, δs controls the ﬂow rate from healthy to susceptible state by controlling worm scans. For example, the SQL-Slammer worm has a high value for δs . Infection parameter δi : Individual strength parameter δi deﬁnes the strength of a node, which further regulates the rate of infection ﬂow from that node. In addition to the email server protection mentioned above, individual nodes may also be strengthened or, on the contrary, they may have the required immunity against some known threats. Latent time factors for infections are preferably conﬁgured with δi . Thus, a node being visited by a worm may have a latent time period before becoming infectious. In this case δi is the most appropriate parameter to use, which is also deﬁned as susceptible-to-infectious parameter. User vigilance parameter δq : Virus detection abilities and user vigilance are modeled with the user vigilance parameter δq . User awareness, infection discovery techniques, and quarantine queue handling use this parameter. For example, increasing the rate of infectious-to-quarantine can lead to shorter recovery times with the effective use of quarantine time. δq is also called as infectious-to-quarantine parameter. 223

Internet epidemiology

Quarantine parameter δr . The quarantine efﬁciency parameter δr controls the rate at which the infected nodes are cured (diagnoses and removal operations) and delivered to acceptance test by the recovery function. There are various quarantine techniques, therefore, in many cases we need to analyse the consequences of quarantine delays. In short, the efﬁciency of required quarantine operations is controlled with δr , which is also called as quarantine-torecovery parameter. Recovery parameter δh . The recovery parameter δh controls the efﬁciency of harnessing the recovered systems. More precisely, δh controls the rate at which a recovered system is patched, conﬁgured, and returned to the healthy population. For example, a delay in service availability or reduced throughput due to shortage of repairmen can be modeled with δh , also called as operational (or recovery-tohealthy) parameter.

5. EVOLUTION OF THE STATES To simplify the reliability analysis of the affected nodes we classify machine reliability states into two groups of processes, failure and repair analysis, respectively, where the failure analysis makes use of the “patient” (or infection) parameters (tuples), β, δs , and γ, δi , and the repair analysis uses the “care service” parameters, κ, δq , α, δr and, µ, δh . Appropriate modeling of these parameters is important, because they are used to determine the overall state of an optimum system, in order to describe the strength of the protection of individual nodes, quarantine, and recovery efﬁciencies of the failed (infected) systems. The patient parameters, δs , and δi mainly depend on the protection strength of the environment, which can be determined either analytically or empirically, as appropriate. As presented in the following, these parameters are determined from composite probabilistic models based on the average connectivity k with infectious systems, susceptibility, infection, quarantine, and recovery rates. We use here the notations and conventions of queueing theory. Service requests (arrivals) for repairs and repair completions (departures) are modeled as a ﬁrst-comeﬁrst-served (FCFS) Markovian queue with exponential distribution of both interarrival and repair times. That is, the REM model can be thought of as a typical queueing system with some speciﬁc statistical properties that are described in the following sections. Looking at it from a slightly coarse perspective, the infection and extinction processes take place in four consequent phases: (1) initial triggering and susceptibility growth, (2) epidemic outbreak (infection spread), (3) discovery and delivery of infected hosts to the quarantine service, and (4) the recovery phase. Evolution of these phases assumes that at least one host is initially infected (initial triggering), so that a phase of susceptibility growth takes place as a binomially distributed process. Following this phase the spread of worms through the susceptible population takes place as an exponentially distributed process. The infected hosts are then inspected, 224

S. Kondakci and C. Dincer

discovered, quarantined, and delivered to the queue of the repair service. Following the repair process, the repaired nodes are patched, equipped with anti-virus software (vaccinated), and merged into the healthy population. 5.1. Initial susceptibility and epidemic outbreak A host is susceptible if it maintains contact with one or more infectious hosts. The susceptible host, if not protected sufﬁciently, will be infectious (transmitter), and thus, a cascade of worm spread will start slowly, intensify, and annihilate over the course of time. The cascade process is modeled as a random branching process with a binomial generation function, see Refs. [39,40] for a review of branching processes. Indeed, a branching process is a Markov process that models a population in which each individual in generation n produces a random number of individuals in generation n + 1, according to a ﬁxed probability distribution that does not vary from individual to individual. This is true for the same type of worm attacking the same type of machines, but, otherwise, the probability distribution will always differ for various worm types and machines as well. Hence, the initial worm activities take place in two consecutive phases; an initial epidemic threshold and -outbreak, which occurs following the initial threshold time, respectively. Prior to the initial epidemic threshold there is a large number of susceptible hosts while the probability of initial outbreak is relatively small. The spread rate of worms is mainly affected by the worm type and the number of susceptible contacts with infectious hosts. For example, in an insufﬁciently protected network, the Nimda worm [41], has an initial infection rate of P(n) = (1−n/78.6)−0.463 expressed by the number of infectious hosts, n, that have contacts with the newly infected host. More formally, assume that a number of susceptible individuals are selected at random and inoculated with the virus. There is a probability ξk (t) per time unit at which the infectious node k will transmit the virus to a neighboring susceptible node. Therefore, if a susceptible node ϑ has n independent infectious neighbors during the time t, the total probability P(ϑ) that the node becomes infected during that time unit is P(ϑ) = 1−

n

[1−ξi (t)]

i=1

5.1.1. Degree of connections and critical threshold. Due to its growth behavior and vertex structure, the Internet bears the characteristics of a scale-free network, which is deﬁned by the power-low degree distribution P(k) ∼ k−ϒ ; 2 ≤ ϒ ≤ 3. The degree (k) of a node is deﬁned as the number of links adjacent to it, i.e., the number of connections to other vertices. That is, the probability that a node has connections to k other vertices is deﬁned by the power-law Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

probability distribution P(k). Scale-free networks are characterized by two major properties, growth and preferential attachment. The growth property implies that the number of nodes in a network increases in time, while latter refers to the fact that new nodes tend to connect to each existing node to a great extent. Thus, probability P(ki ) that a new node will be connected to node i depends on the degree ki of node i is given by [42]: ki P(ki ) ∼

only for R0 > 1. In this case the epidemic is able to generate a number of infected individuals larger than those which are removed, leading to an increase of the infected individuals i(t) at time t following the exponential form i(t) i0 et/Td Here i0 is the initial density of infected individuals and Td is the typical outbreak time expressed as [10] Td =

ki

i

The probability that a link points to a node with c connections is given by cP(c)/k, where k denotes average degree of connectivity. Thus, the average probability (λ) of a link pointing to an infected node in a homogeneous network having the density of infected nodes ρk is computed as (λ) =

1 k P(k)ρk k

(15)

k

hence, the critical epidemic threshold λc is deﬁned as

k λc = k 2

k P(k)λc k

k

= 1 ⇒ λc =

k k2

1 µ(R0 −1)

5.2. Epidemic spreading The epidemic threshold is crossed when at least one contact to infectious nodes is being infected and makes contact with other susceptible nodes. A branching process will take place after the initial epidemic outbreak. The probability that each of the ﬁxed number of contacts or viruses causing k infections is given by pk (t). Prior to the epidemic outbreak, the branching model (ξ) is a random variable having a binomial distribution with parameters p and n, so that Pξ {k|n, p} =

n k (n−k) pq , q = 1−p, k = 0, . . . , n k

(16)

(18)

For homogeneous networks, in which the connectivity ﬂuctuations are negligible, the total prevalence ρ(t) is deﬁned in Ref. [43] as the density of infected nodes present at time t:

Let ξ(t) be the total number of infectious hosts at time t representing a Markov process (continuous time Markov chain). Suppose we have exactly k infectious hosts initially at time t = 0, and let ξi (t) be the number of infections generated by the ith infectious host after a time t, then ξ(t) = ξ1 (t) + . . . + ξk (t), where the random variables ξ1 (t), . . . , ξk (t) are independent and have the same probability distribution P{ξi (t) = n} = pn (t), n = 0, 1, 2, . . . Let pkn (t) be the probability of k infectious hosts causing n new infections after time t, so that the numbers pkn (t) are the transition probabilities of the Markov process ξ(t). For the analysis of this Markov process, we assume that we have a ﬁnite (or countable) state space I = {0, 1, 2, . . . , }. If we are in a certain state (e.g., j ∈ I) at time t in a Markov process, then we can compute the probability of being in a different state k ∈ I at time t + t expressed as

k

dρ = −ρ(t) + λkρ(t)[1−ρ(t)] dt

(17)

5.1.2. Reproductive number. A fundamental parameter in epidemiology is the basic reproductive number R0 [44], which in most epidemic models has a speciﬁc value, the epidemic threshold, above which epidemics are possible, but below which epidemics cannot occur [45]. The reproductive ratio R0 is deﬁned as the expected number of secondary infections of an initial infectious individual in a completely susceptible host population, and is related to the likelihood and extent of an epidemic. Under the assumption of a homogeneously mixing population, if an infected individual is in contact with k other individuals, the basic reproductive number is deﬁned as R0 =

λk µ

where λ is the spreading rate, deﬁned as the probability that a susceptible individual in contact with an infectious individual will contract the disease, and µ is the removal rate of infected individuals, either to the susceptible or the recovered states. It is easy to understand that any epidemic will spread across a susceptible fraction of the population Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

pk (t) = P{ξ(t + t) = k|ξ(t) = j} Since we are interested in the evolution of the worm spread for t > 0 distributed over small time steps, we need to determine the value for pk at the origin. From here we can ﬁnd rate of transition (γ) to infected state. Assuming that pk is differentiable at 0, its derivative p k (0) can be obtained as p k (0) = lim

t→0

= lim

t→0

P{ξ(t + t) = k|ξ(t) = j} t pk (t)−pk (0) = γjk t−0 225

Internet epidemiology

S. Kondakci and C. Dincer

We can write this as P{ξ(t + t) = k|ξ(t) = j} = γjk t + o(t)

(19)

where o(t) denotes an inﬁnitesimal of higher order than t having the property limt→0 o(t) = 0. t We have the Markov property stating that if we know the state ξ(t) then all additional information about ξ at times prior to t is irrelevant for the determination of future states. That is, for all k = j, t0 < t1 , . . . , tn < t and x0 , . . . , xn ∈ I we have P{ξ(t + t) = k|ξ(t) = j, ξ(ti ) = xi } = P{ξ(t + t) = k|ξ(t) = j} = γjk t + o(t),

i = 1, 2, . . .

(20)

Let us deﬁne γij as the rate at which the infection process enters state j from state i, which is also termed as transition intensity from i to j. We can deﬁne a Q-matrix, Q = (γij : i, j∈I), containing all information about the transitions within the state space of infection I. The Q-matrix of a Markov process has the following properties: • all diagonal entries γii are non-positive stating no change in the current state, • all non-diagonal entries γij , i = j, are non-negative stating a transition from the current state, • the sum over the entries in each row is zero. The probability of the infection process undergoing a change of states in a small time interval t given by Equation (22) can be veriﬁed as follows. During random worm scans, transitions occur randomly in such a way that the probability of a single infection in the time interval (t, t] is p1 (t) = γt + o(t) the probability of producing more than one infections is o(t). That is, the probability of a single infectious host causing one new susceptible (or infected but not yet discovered) hosts in a small time interval t is p1 (t) = γt + o(t) and the probability of not causing any infection by that host is p0 (t) = 1−γt + o(t)

(γt)n n!

where ξ(t) is Poisson distributed with parameter γt. 226

pk =

γ(t) + o(t) if k = j + 1 0 + o(t) if k > j + 1 0 if k < j

Thus, we have the Q-matrix

−γ 0 0 0

γ −γ 0

0 γ −γ

...

0

0 0 γ .. .

... ... 0 ... .. .. . .

Setting i(0) = 0, an analytical solution will lead us to the general propagation formula (21), which is a Poisson process with rate γ: pn (t) = P{i(t) = n} = e−γt

(γt)n , n!

for n = 0, 1, 2 . . . (22)

where γ = Ei(t) gives the average number of nodes found infected in t time units.

5.3. Infection growth as a branching process Following the initial epidemic outbreak a worm reproduction process (branching) will start, which can be described as a pure birth-process having the following properties: • a single infection with probability p = γt + o(t), • more than one infections with probability o(t), • zero infection with probability q = 1−γt + o(t). Let i(t) = γt + o(t) = n be the number of infections at time t and q = 1−i(t). Then the probability of no infections at time t + t is P{i(t + t) = n|i(t) = n} = qn = [1−γt + o(t)]n = 1−nγt + o(t)

Occurrences in a Markov process during the time interval (t, t + t] are independent of the occurrences before time t. This and above properties lead us to a general propagation process having Poisson characteristics with rate γ, i.e., assuming ξ(0) = 0 and n = 0, 1, 2, . . ., we have pn (t) = P{ξ(t) = n} = e−γt

Following this brief introduction, we can now build the Q-matrix containing the required transition rates for the infection process. Assume the pk = P{i(t + t) = k|i(t) = j} gives the probability of k infections in an infection event occurring as a Markov process is deﬁned as

(21)

(23)

The probability of one infection at time t + t, given i(t) = n, is P{i(t + t) = n + 1|i(t) = n} = ni(t)qn−1 = n[γt + o(t)]qn−1 = nγt[1−γt + o(t)]n−1 = nγt + o(t)

(24)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

Finally, The probability of more than one (k) infections at time t + t, given i(t) = n, is

Substituting s = −γk gives γ (n−1) (n−1)! = ak

P{i(t + t) = n + k|i(t) = n} n i(t)k qn−k + o(t) k

Hence, (n−1)! ak = = (j−k)

n [γt + o(t)]k qn−k + o(t) k

=

n k γ (t)k [1−γt + o(t)]n−k + o(t) k

= o(t)

(25)

−γ 0 0

γ −2γ 0

0 2γ −3γ

...

0

0

0 0 3γ .. .

... ... 0 ... .. .. . .

s+γ 0 sI−Q = 0 0

∀(n≥1)

(26)

−γ s + 2γ 0

0 −2γ s + 3γ

...

0

0 0 −3γ .. .

... ... 0 ... .. .. . .

The coefﬁcients rin , n = 1, 2, . . . of the inverse matrix satisfy r11 (s + γ) = 1,

and

r1,n−1 [(−(n−1))γ] + r1,n (s + nγ) = 0,

for n ≥ 2

(27)

By solving this we get

1 (n−1)γ r1,n−1 = γ n−1 (n−1)! s + nγ s + kγ n

r1n =

n n−1

k−1

(−1)k−1

1 s + kγ

Inversion of the Laplace transform will give the probability of one infection giving rise to n offsprings, p1n (t): n n−1 k=1

As can be seen, i(t) is a geometrically distributed random process with parameter e−γt as lim t → ∞. We can derive Equation (26) by use of the resolvent method as follows. Let s stand for the Laplace parameter needed for the approximation of exponential integrals. To proceed, we need to invert

r1n =

p1n (t) =

Assuming γ1 = −γ and γ2 = γ, and i(0) = 0, then it can be readily veriﬁed that the infection will evolve as a branching process described by pn (t) = P{i(t) = n} = e−γt (1−e−γt )n−1 ,

n−1 (−1)k−1 k−1

from which we have

k=1

Hence, i as a Markov process having the following Qmatrix

j =k

=

[γ(j−k)]

j =k

=

= e−γt

k−1

n−1 n−1 k=0

pn (t) = e

−γt

(−1)k−1 e−kγt

k

(−1)k e−kγt

(1−e−γt )n−1

(28)

A similar result is obtained in Ref. [25], which determines the total progeny of the branching process as a Borel--Tanner distribution, i.e., P{I = k} =

I0 (kγ)(k−I0 ) e−kγ k(k−I0 )

(29)

where, I0 denotes the number of initially infected hosts, P{I = k} denotes the probability that the total number of hosts infected is k (varying as k = I0 , I0 + 1, . . . , I0 ), and γ denotes the mean Poisson rate. The results of a branching process expressed by Equation (28) are shown in Figure 5. The branching rate γ is always initially high, and as the branching evolves over time the probability pn (t) of producing new n offsprings diminishes exponentially conforming to the behavior of the deterministic simple epidemic model. In Figure 5a, we have a constant branching rate and varying target offsprings 40, 60, and 100. However, as shown in Figure 5b, we observe the effect of the branching rate γ, which indeed depends on the worm scan strategy. For convenience we based our tests on the uniform scanning worms.

k=1

We can use partial fractions to ﬁnd ak , k = 1, . . . , n such that n ak r1n = s + kγ k=1

Thus, we obtain γ

(n−1)

(n−1)! =

n k=1

ak

(s + jγ)

j =k

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

5.3.1. Annihilation of the infection growth. A single branching process can be modeled as an M/M/1queue, which is stochastic having exponential branching rate γ and exponential annihilation rate µ. That is, infections at a node occur with rate γ and removal of the infections occurs with rate µ. Let i(t) be the number of infected nodes in the queue each undergoing an annihilation process withe rate µ. Annihilation is a process where only a pair of nodes (the attacker and the victim) are involved at a time. A worm 227

Internet epidemiology

S. Kondakci and C. Dincer

a) Branching, constant γ = 0.086 0.01

n = 40

0.008

n = 60 0.006

pn(t)

n = 100

0.004

0.002

0 0

20

40

60

80

100

120

140

160

140

160

b) Branching, constant n = 40 0.01

0.008

pn(t)

0.006

γ = 0.05

γ = 0.09 γ = 0.07 0.004

0.002

0 0

20

40

60

80

100

120

Time

Figure 5. The initial worm outbreak modeled as a branching process.

attacks a susceptible node, annihilate each other, and if the susceptible node gets infected it launches one or more uniform scans against a set of other nodes. The scan of other nodes will be broken during an annihilation process if the target node does not get infected. A related Markovian theorem states that if A and B are independent exponentially distributed random variables with rate φ and ϕ, respectively, then their minimum m(A, B) is also exponentially distributed with rate φ + ϕ and it is independent of any event resulting in A = m(A, B) and vice versa. Hence, it is stated that P{m(A, B) = A} = and P{m(A, B) = B} =

−γ µ 0 0

γ −(γ + µ) µ

0 γ −(γ + µ)

0

0

... ... 0 ... .. .. . .

This indicates that there is at least one infected host in the queue, the distribution of the transition (jump) is exponential with rate γ + µ, and the probability that the queue is becoming shorter is µ/(γ + µ). 228

pn (t) = (1−γ)(1−µ)(γ)n−1 where

(30)

ϕ φ+ϕ

0 0 γ .. .

p0 (t) = µ

=

φ φ+ϕ

Then, the Q-matrix of this Markov process becomes

Now, assuming the same single infection that is being annihilated with probability µt + o(t) in a small time interval t, the probability of n infections being annihilated can be found by ﬁrst setting γ0 = µ, γ1 = −(γ + µ), γ2 = γ, ∀(n≥0) :

1−e(γ−µ)t µ−γe(γ−µ)t t 1+γt

(31)

if γ =

µ if γ = µ

The results of the annihilation of a branching process is illustrated in Figure 6. As can be clearly seen from Equation (31), the annihilation is dominated by the value of the annihilation rate µ and the branching rate γ combined with the number of offsprings. As shown in Figure 6, the probability of the proliferation of infected nodes intensiﬁes initially but dumps slowly during the course of time. Total extinction probability p0 of the branching process having the growth rate γ and recovery rate µ can be easily veriﬁed to be p0 =

γ µ

1

if µ < γ if µ > γ

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

a) Annihilation, constant µ = 0.001 and γ = 0.099 0.01

n = 40 0.008

n = 60 n = 100

pn(t)

0.006

0.004

0.002

0 0

50

100

150

200

250

300

350

400

450

500

400

450

500

b) Annihilation, constant n = 40 0.01

γ = 0.99 µ= 0.001

0.008

n

p (t)

0.006

γ = 0.05 µ = 0.005

γ = 0.9 µ = 0.01

0.004

0.002

0 0

50

100

150

200

250

300

350

Time

Figure 6. The annihilation of a worm outbreak modelled as a branching process.

5.4. Infected-to-quarantine state Following the discovery of the infections, infected hosts will be delivered to the quarantine service for diagnose and disinfection operations. We assume here that we have a very large number of nodes initially with considerably low probability of infection. Thus, the initial process uses the random variable κ for Poisson arrivals of infected hosts assumed both for queueing and quarantine processes: P{ξ(t) = k} =

κk −κ e , k!

k = 0, 1, 2, . . .

(32)

As before, κ = Eξ(t), denotes the average number of quarantined nodes. The times between arrivals to the quarantine service is a Poisson process with the exponential distribution parameter κ. Suppose that the quarantine requests for infected nodes occur randomly in the course of time, where let ξ() be the number of requests occurring during the time interval . Then, by ﬁrst determining the distribution of the random variable ξ(), we can derive a probability distribution function for the quarantine process. Since the request arrivals for quarantine are independent of one another, we can split the given time slot into the non-overlapping intervals ξ(1 ), ξ(n ), . . ., ξ(n ) for the overall arrivals. The probability that at least one arrival occurs in a small time interval t is κt + o(t), while the probability that more than one arrivals occurs in t is Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

o(t). Here, the parameter κ denotes the rate of the arrivals at the quarantine service. Let ξ(t) be the total number of arrivals occurring in the interval (0, t]. Dividing (0, t] into n equal parts 1 , . . . , n , we get ξ(t) =

n

ξ(k )

k=1

where ξ(1 ), . . . , ξ(n ) are independent random variables of which ξ(k ) is the number of arrivals occurring in the interval k . The probability generating function of each random variable ξ(k ) is

Gn (z) = 1−

κt n

+

t κt z+o n n

where, again, o(t/n) is a term of order higher than t/n. Since ξ(1 ), . . . , ξ(n ) are independent random variables taking the values 0, 1, 2, . . ., then the random variables zξ(1 ) , . . . , zξ(n ) , where z is a ﬁxed number, are also independent. It follows from the formula of mathematical expectation E E[ξ(1 )ξ(2 )] = E[ξ(1 )]E[ξ(2 )] and setting ξi = ξ(i ) as a shorthand notation, we get Ez(ξ1 +ξ2 +...+ξn ) = Ezξ1 Ezξ2 . . . Ezξn 229

Internet epidemiology

S. Kondakci and C. Dincer

Thus we have the formula Gξ (z) = Gξ1 (z)Gξ2 (z) . . . Gξn (z)

(33)

expressing the generating function Gξ (z) = Ezξ of the sum ξ = ξ1 + . . . + ξn of the n random variables ξ1 . . . ξn in terms of the generating functions Gξk (z) = Ezξk of the n separate summands. Hence, by combining Equation (33) with the above explanation, the generating function of ξ(t) is

t κt κt G(z) = [Gn (z)] = 1− + z+o n n n t n κt(z−1) +o ] = [1 + n n n

n→∞

t κt(z−1) +o n n

(34)

n

= eκt(z−1)

This is the generating function of a Poisson distribution with parameter κt, so that P{q(t) = k} =

(κt)k −κt e , k!

k = 0, 1, 2, . . .

P{T > t} = e−µt

n

Since G(z) is independent of the subintervals 1 , . . . , n , we can take the limit as n → ∞, achieving G(z) = lim 1 +

incoming service requests of the Poisson type, the repair center has exponential arrival rate α and departure rate µ. The random ﬂow of service requests arriving at the quarantine queue for repair presents an exponentially distributed arrival with density α. Thus, αt + o(t) is the probability that at least one call (infected host) arrives in the small time interval t. We assume that the random repair time T for each incoming repair request at time t has an exponential distribution with parameter µ, i.e.,

(35)

The repairman (or system) has two states, free s0 and busy s1 . Suppose that the system is in the free state at time t0 , then its subsequent behavior does not depend on its previous history, since the jobs arrive independently. The probability p01 of the system going from state s0 to state s1 during a small interval of time t is αt + o(t). Hence, the rate of the transition from s0 to s1 equals α. Now, suppose the system is in busy state s1 , then, the probability p10 (t) of the system gonging from the busy state s1 to the free state s0 after a time t is just the probability that the repair service will fail to last another t time units. Thus, suppose that at time t the service has already been in progress for exactly τ time units, then p10 (t) = 1−P{T > τ + t|T > τ}

Since

= 1−

κt = i0 Eξ(t) the parameter κ is the average number of arrivals of the infected nodes at the quarantine queue occurring per unit time. Here, i0 denotes the number of infections initially present. The quarantine process described by Equation (35) converges to a binomial process for steady state and, in general, for smaller network sizes with relatively higher probability of infections, P{q(t) = k} =

Q p(t)k [1−p(t)]Q−k , k

(37)

P{T > τ + t} P{T > τ}

(38)

Hence, regarding Equation (37), we get the transition rate out of the repair state as p10 (t) = 1−

e−µ(τ+t) = 1−e−µτ e−µτ

(39)

As already noticed, the system can be described by a Markov process satisfying the conditions

k = 0, 1, . . . , Q

p01 (t) = 1−p00 (t),

p10 (t) = 1−p11 (t)

(40)

(36) Here, p denotes the probability of ﬁnding an infected node in a series of identical experiments, Q denotes the number of consecutive Bernoulli trials in which there are precisely k infectious nodes found. 5.5. Transition probabilities of quarantine-to-repair state The quarantine process deals with diagnose and disinfection of the infected computers, leaving the repair to an overall recovery process (next Section) by the repair service. Requests for recovery of the quarantined nodes are modeled as a random ﬂow of service calls arriving at the repair center, where the incoming trafﬁc is of the Poisson type described above, with average density κ. It generally takes a random time unit to repair an infected node. Also considering the 230

where p00 (t) denotes the probability of staying in free state and p11 (t) denotes the probability of staying in busy state. Moreover, let

α00 = −α α10 = µ

α01 = α α11 = −µ

(41)

and pij (0) =

1 0

if j = i if j = i

(42)

Assume that rate of the transition out of state si is denoted by αi and from state si to state sj is denoted by αij . That is, during such a transition, the number of entities i changes to j. Since completions of the service times are exponentially distributed with parameter µ, from Equation (39) we have p10 (t) = 1−e−µt = µt + o(t)

(43)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

Hence, the transitions probabilities pij are such that 1−pii (t) = αi t + o(t)

i = 1, 2, . . .

pij (t) = αij t + o(t)

j = i, ∀(i, j ≥ 1)

(44)

and αii = −αi , then, considering the initial conditions given by (41) and Equation (42), the transition probabilities will satisfy two sets of linear differential equations, i.e., forward (FKE) and backward (BKE) Kolmogorov equations, respectively: FKE : p ij (t) =

pik (t)αkj

5.5.1. Steady state of the quarantine queue. We assume that, due to having a single repairman, the quarantine process works as a birth--death (Markov) process, where birth and death represent the arrival at and departure from the quarantine service, respectively. It should be noted that, conditions and expressions can be easily derived for the case of multiple repairmen based on the Markovian approach. The process is speciﬁed by birth rates α (entering quarantine) and death rates µr (leaving quarantine while entering recovery). Let pn denote the equilibrium probability of state n. Since state (number in the system) transitions follow Markovian process, their ﬂow rates must balance in a steady state, similarly for each pair of states as:

k

BKE : p ij (t) =

αpn−1 = µr pn , αik pkj (t),

∀(i, j ≥ 1)

(45)

k

so that pn =

Hence, using Equation (40) and deﬁnitions (41), Equation (45) will evolve as

= −αp00 (t) + µ[1−p00 (t)] = α[1−p11 (t)]−µp11 (t)

+ (α + µ)p00 (t)

α=

p 11 (t)

+ (α + µ)p11 (t)

p00 (t) =

p11 (t) =

µ 1− α+µ 1−

α α+µ

e

−(α+µ)t

(47)

Assuming the initial conditions p00 (0) = p11 (0) = 1, we can solve Equation (47) to give steady state probabilities as

α α+µ

α p2 = p1 = µr ...

pn =

n

α µr

1=

∞ n=0

1 = p0

p0

(52)

∞ α n n=0

α n ∞

µr

and thus,

(49)

p0 = 1− σ = α+µ µ ζ= α+µ α ρ= α+µ

2

p0

pn =

n=0

(48)

Deﬁning,

α µr

Since the steady state probabilities must sum to 1 as

µ + α+µ

e−(α+µ)t +

α p1 = p0 µr

=

(46)

i.e., again from Equations (40), (41), and (46) we have µ=

(51)

= α01 p10 (t) + α11 p11 (t)

p 00 (t)

α pn−1 µr

The probability of having n machines under disinfection is denoted by pn . The recurrence formula of moving from pn−1 to pn can be used to determine the steady state probability space :

p 00 (t) = α00 p00 (t) + α10 p01 (t) p 11 (t)

αpn = µr pn+1

µr

p0 (53)

=

p0 1− µαr

α µr

which is the proportion of the time the system is idle. Thus, for n > 0 we obtain the probability, pn , that there are n machines under quarantine (queue + disinfection) in the system as n n α α α pn = p0 = 1− µr µr µr

we write the state transition probabilities as p00 (t) = (1−ζ)e−σt + ζ

5.6. Recovered-to-healthy state

p11 (t) = (1−ρ)e−σt + ρ p01 (t) = 1−[(1−ζ)e−σt + ζ] = 1−p00 (t) p10 (t) = 1−[(1−ρ)e−σt + ρ] = 1−p11 (t)

(50)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Assume that we have now many hosts in the recovery queue awaiting repair (i.e., OS patch and update) operations. Following the disinfection, some necessary patches and updates of the operating system and other applications 231

Internet epidemiology

S. Kondakci and C. Dincer

running on that system must be performed before harnessing the host. Total repair times are exponentially distributed random time intervals, assuming that each host waits for T random time units and requires another random time unit v to be repaired, then, P{T ≥ v} = e−µh v will be used to determine the probability of the total time required for the recovery (repair) operation. The times between repairs (sojourn times), S, is a Poisson process with the exponential distribution parameter µh , which is indeed the average ﬂow rate from recovered to healthy state. Thus, the continuous random variable S has the probability density function

machine i and its repair time are exponentially distributed random variables with rates λ and µ, respectively. This is the classical machine repairmen problem, [46--48]. Assume that each repair occurs as a Markovian process at rate µ min[N(t), R(t)], where N(t) is the number of machines failed and awaiting repair at time t, and R(t) = s is the number of repairmen on duty. Thus, this simple birth--death Markov process, wherein jumps in state N(t) occur at exponentially distributed time intervals: P{N(t + t) = i + 1|N(t) = i} = (n−i)λt + o(t) = λi t + o(t) P{N(t + t) = i−1|N(t) = i} = min(i, s)µt + o(t) = µi t + o(t)

fS (t) = µh e−µh t

(55)

and the mean time between repairs is E(S) =

1 µh

Utilization of the repair system deﬁned as the proportion of time the repair system is busy, is given as ρ = 1−p0 , where p0 is the proportion of time the system is idle. Assuming the steady state, the arrival rate at the repair service, µr , is equal to departure (or repair) rate, µh . Thus, from the queueing theory, we have the following balance equations leading to rate-in = rate-out in the steady state. µr = 0p0 + µh (1−p0 ) µr p0 = 1− µh µr ρ= µh

p10 (t) = 1−p11 (t) = 1−e−µh t = µh t + o(t) (54)

This type of a random process can be modeled as a birth-death process as a special case of Markov chains where the states represent the current size of the related population and the transitions are limited to births and deaths. In the REM, the arrival of disinfected machines at the repair service is represented by a birth and a repaired + updated machine is denoted by a death process. The transition rates of the general birth--death are given by: pij =

λ i

µi

µ i + λi i

j = i + 1(birth = arrival) j = i−1(death = departure) j = 0 (no change = under repair) i = j = 0(0state is absorbing)

where λi and µi denote the ith birth and death rates, respectively. In order for system to reach i + 1 from i for the ﬁrst time it must either do so on the ﬁrst transition out of state i, or else drop back to i−1, return to i and try again. The arrival rate from quarantine, µr , and departure rate to healthy, µh , parameters affect the ﬁnal repair rate of the repair facilities. Obviously, the nodes are repairable systems with multiple states, up, down, and repair. Consider a network with m machines, which are maintained by s repairmen. The time to failure (up time) of 232

where, λi = µr (rate to recovery) and µi = µh (rate to health) are the general transition rates of arrival and departure corresponding to rates in state i as µr and µh , respectively. Further readings can be found in Ref. [49,50], where [49] contains a comprehensive treatment of stochastic models related to system performance and reliability, and [50] presents a Markov model for multistate repairable systems. Consequently, the transition rate to the healthy state given by Eq. (55) is in accordance with Equations (37,38,39) and (40): (56)

5.6.1. Throughput of repair facilities. The capacity of the recovery system depends mainly on the number of repairmen; assigning single or multiple repairmen dominate the overall efﬁciency of the recovery service. For the case of single-repairman and n repair requests, we have the overall throughput ratio as

ρ=

µr µh

n (57)

and, therefore, from the queueing theory we have the probability of having n machines under repair is denoted by pn : pn = ρn p0 ,

for n = 1, 2, . . .

(58)

where, implicitly, 1

p0 = 1+

=

∞

= ρn

∞

!−1 ρn

n=0

n=1

1 1−ρ

−1 = (1−ρ)

(59)

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

Thus, the steady-state probability for the state n is pn = ρn p0 = ρn (1−ρ),

for n = 1, 2, . . .

Obviously, the ﬁrst term, (60) Lq =

Regarding the case of multiple-repairmen, the overall throughput ratio factor is (µr /µh )n ρn = , n!

for n = 0, 1, . . . , s

(61)

where s is the number of repairmen available, and (µr /µh )s ρn = s! =

µr sµh

n−s

(µr /µh )n , n = s, s + 1, . . . s!sn−s

p0 =

n!

n=0

(µr /µh )s + s![1−(µr /sµh )]

#−1 (63)

Thus, the probability pn is

pn =

(µ /µ )n r h p0 if 0 ≤ n ≤ s n!

n (µr /µh ) p

s!sn−s

if n ≥ s

0

Based on the Erlang distribution and setting ρs = µr /sµh , we get the expected number of machines in the repair system (queue + repair) for a multiple-repair facility as L=

p0 (µr /µh )s ρs µr + s!(1−ρs )2 µh

(64)

M = 100, N = 98, Repairmen = 4 100 90

λ = 0.5, µ = 8

80 70 60

λ = 1, µ = 10

50 40 30 λ = 1, µ = 4

20 10 0

0

1

2

3

4

(65)

gives the expected queue length waiting for repair. Figure 7 shows the simulation results of three experiments using random arrival and recovery times with a repair capacity of four repairmen. Initially we have M = 100 machines, N = 98 of which are up and running, so that two repairmen are idle and two begin recovering the failed machines, immediately, at time t = 0. Mean arrivals at the recovery queue is represented by λ and the mean recovery time is depicted by µ time units.

(62)

Assuming µr < sµh , which is often a desired case, then

" s−1 (µr /µh )n

p0 (µr /µh )s ρs s!(1−ρs )2

5

Time

Figure 7. A service facility with four repairmen, where the infection and repair rates are random. Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

6. EXPERIMENTS We have set up and run two sets of experiments each conﬁgured differently to represent systems with constant and random ﬂow rates, respectively. In this combined failure and repair model, failed machine arrivals at the quarantine service occur in a random fashion, i.e., arrivals enter at discrete times 0, 1, . . . , k, only and service completions occur only at such times. Related to the discussion before, changes in the ﬂow densities can brieﬂy be expressed as H → S : β(t + 1) = β(t) × δs (t) S → I : γ(t + 1) = γ(t) × δi (t) I → Q : κ(t + 1) = κ(t) × δq (t) Q → R : α(t + 1) = α(t) × δr (t) R → H : µ(t + 1) = µ(t) × δh (t) The evolution of the propagation and extinction processes of random scan worms is mathematically justiﬁed in Section 5. The results obtained by the experiments and simulations also conﬁrm to the theoretical analysis given therein. The following ﬁgures, Figures 8 and 9, illustrate results of the experiments based on constant arrival and departure rates within each state of the recurrent model. As brieﬂy outlined, the rates of change for a network of 100 nodes using the parameters with the ini− tial values of β = 0.21, ← s = 0.03, γ = 0.76, κ = 0.62, α = 0.83 and µ = 0.88 are shown in Figure 8a. Figure 8b illustrates, the effect of the increased susceptibility and constant latent times of the same experiment, but for each state, infection, quarantine, and recovery processes are changed to behave in a slightly random manner. Parame− ters for this experiment are β = 0.81, ← s = 0.03, γ = 0.76, κ = 0.62, α = 0.83 and µ = 0.88. Changing the infection, quarantine, and recovery parameters causes signiﬁcant changes in the states. In particular, the inclusion of delay times in passages (latent times) leads to a cascade of latencies in between the ﬁve states. The result of the experiment 233

Internet epidemiology

S. Kondakci and C. Dincer

(a) Lower Susceptibility & Zero Latent Times

(b) Higher Susceptibility & Random Process Times

14

35

Susceptible

12

30

Number of Machines

10

25

Infected

8

20

Quarantined

6

15

Recovered

4

10

2

0

5

0

10

20 TIME

30

40

0

0

5

10

15 TIME

20

25

30

Figure 8. The effect of constant arrival and departure rates; (a) lower susceptibility, (b) higher susceptibility, random infection, quarantine and recovery processes for a network of 100 nodes. Random Passage (Latent) Times 25

Susceptible

Number of Machines

20

Infected

Quarantined

Recovered

15

10

5

0 20

25

30

35

40 TIME

45

50

55

60

Figure 9. The result of inserted latent time for infection, and effect of higher delays for quarantine and recovery services; ˇ = 0.86, RSS = 0.028, = 0.76, = 0.63, ˛ = 0.83 and = 0.88.

− with initial parameters β = 0.86, ← s = 0.028, γ = 0.76, κ = 0.63, α = 0.83, and µ = 0.88 is depicted in Figure 9. Here, we examine the effect of increased latent time before a machine becomes infectious, i.e., we have increased infectious-to-quarantine time and quarantine-torecovery delays to monitor the overall effect. 234

6.1. Modified recurrent model All the experiments thus far have assumed constant mean arrival rate and departure rate, regardless of the latent time delays in the infection process and ﬂuctuations in quarantine and repair times. First of all, a node does not become Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

Internet epidemiology

30

25

I Q

Random Latency and Passage Times

Number of Machines

20 R

Infected 15 Quarantined

10 Recovered 5

0 0

20

40

60

80

100 TIME

120

140

160

180

200

Figure 10. Random rates for a network of 100 nodes of REM using random latent time parameters and random process behaviors.

infectious/transmitter immediately an attack has occurred; it takes a random time, depending on the usage of the computer and the infection type, for a node to become a transmitter. Likewise, the repair facilities have a random throughput, depending on the repairman’s abilities and the size of damage of the system under repair. It is more realistic to have random arrival and departure rates, thus, we modify the system so that the related parameters are uniformly random. We must incorporate the random latent time periods and random service rates into the modiﬁed model. Figure 10 illustrates the output of the new modiﬁed model with the random parameters incorporated. It should be noted that, as shown in Figure 8, for the above parameters, the basic REM behaves in a signiﬁcantly different way compared to the advanced (randomized) model shown in Figure 10. However, after a long run of experiments with huge number of hosts, a steady state with similar shapes having signiﬁcant effect of latencies in between the states will be achieved. As shown in Figure 10, delays caused by the latent time during the infection period, user vigilance, and reduced recovery capacity leads to increased delays spreading over a longer time internal. Unfortunately, lack of real data makes it difﬁcult to compare the simulation results to real-life scenarios.

6.2. Optimization parameters It can be readily observed that infection and susceptibility rates in this propagation and extinction model play the Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

central role for the growth of repair requests while the quarantine and recovery rates affect the ﬁnal recovery process. It is often the fact that, once a machine in a network is infected the infection will immediately spread if the susceptibility rate is high enough, and hence, requests for machine repair will increase. If the repair service (quarantine + recovery) has a shortage of resources then the system may lock at some point. The recurrent system is typically controlled by two joint sets of parameters: 1. Susceptibility and infection parameters, δs and δi . 2. Discovery, quarantine, and recovery parameters δq , δr , and δh . It is obvious that susceptibility is mainly dependent on network topology, degree of connection, and frequency of the usage of machines. Since the usage of machines cannot be controlled for individuals, susceptibility can only be reduced by using virtually isolated networks and training of the users. Thus, the associated parameter δs can be controlled by two additional coefﬁcients, topology (e.g., the number of hosts affected by each other) and the user vigilance, respectively. The susceptible-to-infected parameter δi depends on δs , protection strength of the hosts, and the strength of the attacker. The infected-to-quarantine parameter δq , quarantine-to-recovered parameter δr , and recovered-to-healthy parameter δh depend on a combination of the user vigilance and capabilities of the repair facilities, and network administrators. These dependencies are 235

Internet epidemiology

reﬂected as a total delay in the recovery process, which is also illustrated in Figure 10.

7. CONCLUSIONS There exist numerous worm propagation models based on classical epidemic models, which by context are similar and mostly deterministic. Few of the recent models go deeper into details of stochastic approaches. We presented here a pure stochastic model covering a broad aspect of the Internet epidemiology, where the propagation and recovery models are discussed in details. Each unique propagation model should have a corresponding model for recovery that align with characteristics of the propagation in order to achieve better results. In contrast to bare worm propagation modeling, we need to focus more on accompanying recovery modeling. Here, we have presented two sets of unique stochastic models, a worm propagation model and a recovery model associated with it. Both the propagation and recovery processes were shown to be recurrent. It is important to determine worm propagation rate in a dynamic context. In many cases, all nodes in a network can be considered susceptible, if one is a transmitter. Most of the classical epidemic models rely on this assumption, however, in a realistic case, this might be critically misleading. Our model makes a clear distinction based on the protection level of the nodes combined with the strength of attackers, the recurrence parameters, infection, spread, and the recovery rates. Therefore, we have presented a more detailed realistic model consisting of the most fundamental parameters. Among them, the rates of susceptibility, quarantine, and infection, and capabilities of the recovery facilities played fundamental roles for conﬁguring the overall propagation, spread, and extinction rates. By modifying these parameters, one can easily capture the dynamics of a realistic epidemic picture of almost any population. In addition to the analysis presented here, we need to perform further studies on the presented model. It was observed that, short infection periods may be achieved by fast quarantine and recovery techniques or by the inherent system immunity. Extremely short infection periods are not considered as realistic infections. Therefore, it is important to determine whether the infection periods are abnormally short before ﬁtting the data to an experimental distribution. Heterogeneity of victim systems is an important issue to consider, which can be studied by modeling behaviors of the infection and recovery states with modiﬁed random parameters. This analysis is important for building a proper model with realistic latent time periods and worm characteristics. Furthermore, the model presented must be expanded to incorporate new births (recently added nodes), nonrecovered (dead) nodes, and exported agents (e.g., infected messages) to external networks. Due to the limited scope and size of the paper, we were unable to present the entire work here. Ideally, a holistic recurrent model for the entire Internet deserves a separate treatment. 236

S. Kondakci and C. Dincer

REFERENCES 1. Rohloff KR, Bas¸ar T. Deterministic and stochastic models for the detection of random constant scanning worms. ACM Transactions on Modeling and Computer Simulation 2008; 18(2): 1--24, DOI: http://doi.acm.org/ 10.1145/1346325.1346329. 2. Kondakci S. Epidemic state analysis of computers under malware attacks. Simulation Modelling Practice and Theory 2008; 16(5): 571--584, DOI: 10.1016/ j.simpat.2008.02.011. 3. Rohloff K, Basar T. Stochastic behavior of random constant scanning worms. Computer Communications and Networks, 2005.ICCCN 2005. In Proceedings of 14th International Conference on 2005, 339--344, DOI: 10.1109/ICCCN.2005.1523881. 4. Nicol DM. The impact of stochastic variance on worm propagation and detection. In WORM ’06: Proceedings of the 4th ACM workshop on Recurring malcode. ACM: New York, NY, USA, 2006; 57--64, DOI: http://doi.acm.org/10.1145/1179542.1179555. 5. Anderson H, Britton T. Stochastic Epidemic models and Their Statistical Analysis, Lecture Notes in Statistics Springer: New York, 2000. 6. Avlonitis M, Magkos E, Stefanidakis M, Chrissikopoulos V. A novel stochastic approach for modeling random scanning worms. Informatics, 2009.PCI ’09. 13th Panhellenic Conference on 2009; 176--179, DOI: 10.1109/ PCI.2009.20. 7. Zou CC, Towsley D, Gong W. On the performance of internet worm scanning strategies. Performance Evaluation 2006; 63(7): 700--723, DOI: 10.1016/ j.peva.2005.07.032. 8. Mukhopadhyay B, Bhattacharyya R. Existence of epidemic waves in a disease transmission model with twohabitat population. Int. J. Syst. Sci. 2007; 38(9): 699--707, DOI: http://dx.doi.org/10.1080/00207720701596417. 9. d’Onofrio A, Biomathematical analysis and extension of the new class of epidemic models proposed by satsuma et al. (2004). Applied Mathematics and Computation 2005; 170(1): 125--134, DOI: 10.1016/j.amc. 2004.10.083. 10. Barth`elemy M, Barrat A, Pastor-Satorras R, Vespignani A. Dynamical patterns of epidemic outbreaks in complex heterogeneous networks. Journal of Theoretical Biology 2005; 235(2): 275--288, DOI: 10.1016/j.jtbi.2005.01.011. 11. Liljenstam M, Nicol DM, Berk VH, Gray RS. Simulating realistic network worm trafﬁc for worm warning system design and testing. In WORM ’03: Proceedings of the 2003 ACM workshop on Rapid Malcode. ACM Press: New York, NY, USA, 2003; 24--33, DOI: http://doi.acm.org/10.1145/948187.948193.

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

S. Kondakci and C. Dincer

12. Moore D, Shannon C. Claffy K. Code-red: a case study on the spread and victims of an internet worm. In IMW ’02: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment. ACM Press: New York, NY, USA, 2002; 273--284, DOI: http://doi.acm.org/10.1145/637201.637244. 13. Moore D, Paxson V, Savage S, Shannon C, Staniford S, Weaver N. Inside the slammer worm. IEEE Security & Privacy 2003; 1(4): 33--39, DOI: http://dx.doi.org/10.1109/MSECP.2003.1219056. 14. Sharif MI, Riley GF, Lee W. Comparative study between analytical models and packet-level worm simulations. In PADS ’05: Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation. IEEE Computer Society: Washington, DC, USA, 2005; 88--98, DOI: http://dx.doi.org/10.1109/PADS.2005.5. 15. Daley D, Gani J. Epidemic modeling, an introduction, Cambridge University Press: Cambridge, U.K, 1999. 16. Brauer F, Castillo-Chavez C. Mathematical Models in Population Biology and Epidemiology, Springer: Berlin, 2001. 17. Diekmann O, Heesterbeek J. Mathematical Epidemiology of Infectious disease, Wiley Series in Mathematical and Computational Biology Chichester: Wiley, 2000. 18. Hethcote H. The mathematics of infectious disease. SIAM Review 2000; 42(4): 599--653. 19. Qing S, Wen W. A survey and trends on internet worms. Computers & Security 2005; 24(4): 334--346, DOI: 10.1016/j.cose.2004.10.001. 20. Ellis D. Worm anatomy and model. In WORM ’03: Proceedings of the 2003 ACM workshop on Rapid Malcode. ACM Press: New York, NY, USA, 2003; 42--50, DOI: http://doi.acm.org/10.1145/948187.948196. 21. Wong C, Bielski S, McCune JM, Wang C. A study of mass-mailing worms. In WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 1--10, DOI: http://doi.acm.org/10.1145/1029618.1029620. 22. Weaver N, Paxson V, Staniford S, Cunningham R. A taxonomy of computer worms. WORM ’03: Proceedings of the 2003 ACM workshop on Rapid Malcode. ACM Press: New York, NY, USA, 2003; 11--18, DOI: http://doi.acm.org/10.1145/948187.948190. 23. Shannon C, Moore D. The spread of the witty worm. IEEE Security & Privacy 2004; 2(4): 46--50, DOI: http://dx.doi.org/10.1109/MSP.2004.59. 24. Li P, Salour M, Su X. A survey of internet worm detection and containment. Communications Surveys & Tutorials, IEEE Quarter 2008; 10(1): 20--35, DOI: 10.1109/COMST.2008.4483668. 25. Sellke SH, Shroff NB, Bagchi S. Modeling and automated containment of worms. Dependable and Secure

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec

Internet epidemiology

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

Computing, IEEE Transactions on April--June 2008; 5(2): 71--86, DOI: 10.1109/TDSC.2007.70230. Wei S, Mirkovic J, Swany M. Distributed worm simulation with a realistic internet model. PADS 2005.Workshop on Principles of Advanced and Distributed Simulation, 2005; 71--79, DOI: 10.1109/PADS. 2005.7. Zou CC, Towsley D, Gong W. Modeling and simulation study of the propagation and defense of internet e-mail worms. IEEE Transactions on Dependable and Secure Computing 2007; 4(2): 105--118, DOI: http://dx.doi.org/10.1109/TDSC.2007.1001. Liu Z, Lee D. Coping with instant messaging worms -- statistical modeling and analysis. LANMAN 2007.15th IEEE Workshop on Local & Metropolitan Area Networks, 2007; 194--199, DOI: 10.1109/LANMAN.2007.4295998. Hatahet S, Challal Y, Bouabdallah A. Bittorrent worm sensor network: P2p worms detection and containment. 17th Euromicro International Conference on Parallel, distributed and network-based processing, 2009; 293-300, DOI: 10.1109/PDP.2009.61. Feng C, Qin Z, Cuthbet L, Tokarchuk L. Propagation model of active worms in p2p networks. ICYCS 2008.The 9th International Conference for Young Computer Scientists, 2008; 1908--1912, DOI: 10.1109/ICYCS.2008.237. Kotenko I. Framework for integrated proactive network worm detection and response. 17th Euromicro International Conference on Parallel, distributed and network-based processing, 2009; 379--386, DOI: 10.1109/PDP.2009.52. Yamada Y, Katoh T, Bista B, Takata T. A new approach to early detection of an unknown worm. AINAW ’07. 21st International Conference on Advanced Information Networking and Applications Workshops, 2007; vol. 1, 2007; 194--198, DOI: 10.1109/AINAW.2007.33. Costa M, Crowcroft J, Castro M, et al. Stopping internet epidemics. International Zurich Seminar on Communications, 2006; 86--89, DOI: 10.1109/IZS.2006.1649086. Chen S, Tang Y. Slowing down internet worms. In ICDCS ’04: Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS’04). IEEE Computer Society: Washington: DC, USA, 2004; 312--319. Xiong J. Act: attachment chain tracing scheme for email virus detection and control. WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 11--22, DOI: http://doi.acm.org/10.1145/1029618.1029621. Wong C, Wang C, Song D, Bielski S, Ganger GR. Dynamic quarantine of internet worms. In DSN ’04: Proceedings of the 2004 International Conference on 237

Internet epidemiology

37.

38.

39. 40.

41.

42.

43.

44.

238

Dependable Systems and Networks (DSN’04). IEEE Computer Society: Washington: DC, USA, 2004; 73--82. Weaver N, Hamadeh I, Kesidis G, Paxson V. Preliminary results using scale-down to explore worm dynamics. WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 65--72, DOI: http://doi.acm.org/ 10.1145/1029618.1029628. Porras P, Briesemeister L, Skinner K, Levitt K, Rowe J, Ting YCA. A hybrid quarantine defense. In WORM ’04: Proceedings of the 2004 ACM workshop on Rapid malcode. ACM Press: New York, NY, USA, 2004; 73-82, DOI: http://doi.acm.org/10.1145/1029618.1029630. Athreya K, Ney P. Branching processes 1972. URL, citeseer.ist.psu.edu/athreya99branching.html. Kalinkin AV. Final probabilities for a branching process with interaction of particles and an epidemic process. Theory of Probability and its Applications 1999; 43(4): 633--640, DOI: 10.1137/S0040585X97977203. Gmb HA. Description of w32/nimda (w32/nimda.eml) -malware March 2008. URL, www.avira.com/en/threats/ section. Albert RZ, Barab´asi AL. Statistical mechanics of complex networks. Reviews of Moderen Physics 2002; 74(1): 47--97, DOI: 10.1103/RevModPhys.74.47. Pastor-Satorras R, Vespignani A. Epidemics and Immunization in Scale-free Networks, Wiley-VCH, Berlin, 2004; DOI: 10.1007/978-3-642-10625-5 26. Aparicio JP, Pascual M. Building epidemiological models from R0 : an implicit treatment of transmission in

S. Kondakci and C. Dincer

45.

46.

47.

48.

49.

50.

networks. Proceedings of the Royal Society B: Biological Sciences, vol. 274 (1609), PubMed Central, 2007; 505--512, DOI: http://10.1098/rspb.2006. 0057. Volz E, Meyers LA. Epidemic thresholds in dynamic contact networks. Journal of the Royal Society Interface 2009; 6: 233--241, DOI: 10.1098/rsif.2008.0218. Archer A, Williamson DP. Faster approximation algorithms for the minimum latency problem. In SODA ’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2003; 88--96. Kogan Y, Choudhury G. Two problems in internet reliability: new questions for old models. SIGMETRICS -- Performance Evaluation Review 2004; 32(2): 9--11, DOI: http://doi.acm.org/10.1145/1035334.1035339 . Blum A, Chalasani P, Coppersmith D, Pulleyblank B, Raghavan P, Sudan M. The minimum latency problem. In STOC ’94: Proceedings of the twenty-sixth annual ACM symposium on Theory of computing. ACM: New York, NY, USA, 1994; 163--171, DOI: http://doi.acm.org/10.1145/195058.195125. Buzacott JA, Shanthikumar JG. Stochastic Models of Manufacturing Systems, Prentice Hall PTR: Englewood Cliffs, NJ, 07632, 1993. Cui L, Li H, Li J. Markov repairable systems with history-dependent up and down states. Stochastic Modellings 2007; 23(4): 665--681, DOI: http://10.1080/15326340701645983.

Security Comm. Networks 2011; 4:216–238 © 2011 John Wiley & Sons, Ltd. DOI: 10.1002/sec