Improving Equipment Maintainability via Alarm ... - Semantic Scholar

1 downloads 635 Views 53KB Size Report
Network and Computing Services. AT&T ... the AT&T network, by optimizing the thresholding rules within the ... alarm warrants further attention, and if so, generates an alert ..... for the prediction to be useful and the monitoring time is the.
Improving Equipment Maintainability via Alarm Threshold Optimization Gary M. Weiss Operations Technology Center Network and Computing Services AT&T [email protected]

The Operations Technology Center (OTC) is responsible for the systems used to manage and maintain the AT&T long distance network. In this paper we describe an effort within the OTC to improve the maintainability of AT&T’s 4ESS switches, by optimizing the thresholding rules within the operation support system responsible for maintaining these switches. This optimization task required us to first identify the purpose of the thresholds, so that we could specify objective criteria for evaluating these thresholds. We then exhaustively searched the space of possible thresholds in order to determine the “best” strategies. One of our key discoveries is that the thresholding strategies, no matter what the threshold value, are relatively ineffective.

The values of n, t, d, and s are stored in a database table. While the analysts can modify this table, experience has shown that these values are never modified—even though they were originally set arbitrarily, without any detailed analysis. Given all of this, we felt that a careful analysis of these threshold rules was in order. The goal of our analysis was to find good, if not “optimal”, values of n and s. This can be considered a simple form of machine learning, where past history (in the form of alarm logs) is used to tune expert system rules. Bell Atlantic used a similar approach to tune numeric parameters in their expert system for diagnosing problems in the local loop of the telecommunication network (Merz, Pazzani & Danyluk, 1996).

Introduction

Relevant Background Information

The Operations Technology Center (OTC) is responsible for the systems used to manage and maintain the AT&T long distance network. This responsibility covers network transport, switching, and signaling services. In this paper we describe an effort within the OTC to improve the maintainability of the approximately 140 4ESS switches in the AT&T network, by optimizing the thresholding rules within the operation support system (OSS) responsible for maintaining these switches. The 4ESS switches are maintained by a specially tailored OSS. Various OSS’s have been developed over the lifetime of the 4ESS switches to assume this responsibility. Most recently, the 4ESS-ES (4ESS Expert System) had this responsibility, although it was to have recently been replaced by ANSWER-4ESS, a similar system but implemented using the R++ language (Weiss, Ros, & Singhal, 1998). The 4ESS switches send alarms to the OSS upon detecting an anomalous condition. In response, the OSS decides if the alarm warrants further attention, and if so, generates an alert and sends it to one of the two AT&T technical control centers, where it is then examined by a human analyst. Thus, the key function of the OSS is to perform alarm filtering. The analyst can ignore the alert, take remote action, or direct a technician co-located with the switch to take manual action. The decision on when to generate an alert is controlled by the expert system component of the OSS, which contains several hundred heuristic rules, acquired over numerous years. Experience has shown, however, that a high percentage of the alerts come from a single thresholding rule, which takes the form:

This section will try to answer the following two questions: “Why have thresholding rules?” and “How are these rules currently used?”. The answer to the first question will provide us with criteria for evaluating the effectiveness of different threshold strategies and the answer to the second question will allow us to gain practical insight into the problem before beginning a detailed technical analysis. Since almost all the thresholding rules are for interrupt alarms, this paper will focus on interrupt alarm thresholding.

Abstract

if n alarms of type t occur on device d within s seconds then generate an alert

Why Have Interrupt Thresholding Rules? Interrupts indicate that an anomalous event has occurred. One common cause of interrupts is a faulty connection between a circuit pack and the slot it is plugged into (this is especially a problem with aging circuit packs). An interrupt causes a circuit pack to run internal diagnostics. If the diagnostics fail, the circuit pack goes out-of-service and the 4ESS sends a DGN-FAIL (diagnostic failure) alarm to the 4ESS OSS; otherwise, the circuit pack returns to service and an interrupt alarm is sent to the OSS. If a circuit pack fails and a standby is not available, then a catastrophic failure of the 4ESS, called a phase, can occur. However, since an interrupt causes a circuit pack to be taken out of service temporarily in order to run diagnostics, it is still possible to have a phase due to interrupts on multiple devices, even if none of the devices actually fails. Thus, there are two reasons to pay attention to interrupt alarms: they may indicate a (service-affecting) failure is imminent or that service may be affected due to a large number of interrupts. Consequently we would like to be able to predict failures based on interrupt alarms as well as predict future interrupts based on past ones. In this paper we will show how interrupt thresholds perform at each of these prediction tasks.

It is worth noting that a previous effort was attempted to directly predict phases (Weiss, Eddy & Weiss, 1998). However, this effort was handicapped due to the extreme rarity of such catastrophic failures. Also, some research has been conducted into trying to learn to predict device failures from patterns of alarms (Weiss & Hirsh, 1998).

Data Analysis In order to analyze our alarm data, we built a tool to accept the alarm data and compute a variety of information (some of which will be displayed in subsequent figures). We will first show how varying the threshold values affect the number of alerts that will be generated. We will then evaluate the performance of the thresholding strategy at predicting future interrupts and future device failures.

How are Interrupt Thresholding Rules Used? The author observed and interviewed six analysts over a two-day visit to the Denver Technical Control Center, and this led to the following two main conclusions: •

Most analysts take no action upon observing an interrupt threshold alert. Rather, they wait until multiple alerts occur on a device before taking action. Thus, they “threshold” on the threshold alerts.



Each analyst behaves differently—there have no clear and consistent guidelines for when to take action.

Effect of Threshold Values on Number of Alerts In order to determine the impact of different thresholding strategies, we incorporated the thresholding code from the 4ESS OSS into our analysis tool and then fed in the alarm data. Table 1 shows the impact that different thresholds have on the number of alerts that are generated. 1

These conclusions indicate that the thresholds are not functioning as desired: they neither free the analysts from “bookkeeping” activities (i.e., manually keeping track of alarms) nor do they indicate that action is required. The analysts also indicated that the reliance on interrupt thresholds might be historically motivated and possibly unjustified in the current environment. In the past, behavior had been dictated by the interrupt-ratio metric, which measures the number of interrupts that occur per 100,000 calls. The goal was to keep this metric below a value of .05. This may have driven the wrong behavior, since this metric is not of direct business value (i.e., it does not measure costs in terms of quality of service or maintenance costs). Furthermore, this ratio was developed many years ago, when there were far more analysts in the two technical control centers and several times more technicians located on-site with the 4ESS switches. It is important to understand what action is taken when the alert is not ignored. In this case, the first thing that is done is to determine if there is a bad connection between the circuit pack and its slot. This is accomplished by having the on-site technician move the circuit pack to a different slot, to determine if the problem (i.e., the interrupts) “follows” the circuit pack; if it does, then circuit pack is removed so that it can be sent to the factory. This process of rotating the circuit pack can be very time-intensive. Furthermore, factory testing often cannot reproduce the problem (due to the intermittent nature of the problem), so the bad circuit pack may ultimately be returned to service. This costly and sometimes ineffective process provides additional motivation for ensuring that the threshold values are set appropriately.

2

3

4

5

6

7

8

9

49

35

1 min

19070 2582

580

264

152 103 66

5 min

19070 3126

852

423

253 183 133 106 78

15 min 19070 3655 1089

546

321 227 158 127 97

30 min 19070 4001 1292

688

380 270 196 149 116

1 hr

19070 4319

1510

848

485 331 236 172 135

8 hr

19070 5471

2376 1361

899 615 445 335 261

24 hr

19070 6179

2983 1801 1210 840 630 481 378

Table 1: Threshold Simulation Table Each threshold in Table 1 is identified by a threshold count (specified in the top row) and a threshold duration (specified in the left-most column). The threshold is hit and an alert generated if the threshold count is met or exceeded within the threshold duration. Almost all of the threshold values in the 4ESS OSS are set to 3 interrupts in 8 hours, which means that in the two-week period 2376 interrupt alerts were generated. This table helped us understand how changing the threshold values would affect the number of alerts generated. We were surprised to see that the duration did not have as much of an effect as we expected—a threshold of 2 interrupts in 1 minute would have resulted in more alerts than the current threshold of 3 in 8 hours.

Predicting Future Interrupts In order to verify that interrupts are meaningful and might indicate a real problem (and therefore allow us to predict future interrupts), we computed statistics to verify that interrupts are not randomly distributed over time. We had our analysis tool assign each interrupt alarm a sequence number and then determined the mean time between interrupts based on their position in the sequence. Sequence numbers were assigned as follows: the first interrupt alarm was assigned a value of 1 and successive interrupts were assigned the next higher integer value until a gap of duration g was encountered; at that time the sequence number was reset to 1. Figure 1 shows, for each sequence position, the mean time in minutes to the next interrupt, given a value of 8 hours for g. Figure 2 shows the absolute number of interrupts assigned to each sequence number.

The Alarm Data The data set contains 148,886 alarms collected over a twoweek period from 75 4ESS switches. Of these alarms, 19070, or 13%, are interrupt alarms. Device failures are recorded by diagnostic failure alarms, which are included in the data set. There are 1045 distinct device failures recorded in the data set—redundant failure alarms which occur prior to the device first going back into service have been removed.

2

Mean Time to Next Intr.

80

figures are of more practical use since they base predictions of future behavior on thresholds, which are what the current OSS is set up to measure.

77

73

61

61

60 43

39

40

Predicting Device Failures

37

35

7

8

We now evaluate the efficacy of predicting device failures based on interrupt thresholding. We first show how interrupt alarms temporally relate to device failures. Figure 4a shows the number of interrupts that occur within the 48-hour period prior to each device failure and Figure 4b shows the number that occur within the one hour period prior to the failure.

20 0 1

2

3

4

5

6

Position in Sequence Number of Interrupts

Figure 1: Mean time between interrupts in sequence Number of Interrupts

10406

10000 8000 6000 4000

3139

2000

1000 800 600 400 200 0

1339

1

821

580

419

317

254

4

5

6

7

8

12

24

36

48

Number Hours Before Failure

0 1

2

3

Figure 4a: Interrupt distribution 2 days prior to failure

Position in Sequence

Figure 2: Number of interrupts by sequence position Number Interrupts

300

Figure 1 demonstrates that, for the most part, as the sequence number increases, the mean time to the next interrupt decreases. This indicates that interrupts are symptoms of a real underlying problem. Figure 2 shows that the number of interrupts at each sequence position drops off relatively rapidly, so that long sequences of interrupts are rare. Figure 3 more precisely quantifies the value of interrupts at predicting future interrupts. For the various sequence positions, it shows the probability that 3 interrupts will be received in the next 1 hour and 8 hour periods. We see that this probability increases monotonically. Probability Threshold Hit

80 3 in 1 hr

60

3 in 8 hr

45 41

30

30

25

20 10

40

12

23

13

5

0 1

2

3

4

5

6

7

15

30

45

60

Figure 4a shows that there is a very striking concentration of interrupts within the one-hour period prior to the failure. Specifically, almost eleven hundred interrupt alarms occur within one-hour of a failure (i.e., on the same device). Figure 4b shows that most of these alarms actually occur within four minutes of the failure. While these results suggest the predictive value of interrupt alarms, they also suggest that predicting failures more than 4 minutes ahead of time may not be possible. Before we can measure the effectiveness of interrupt thresholds at predicting device failures, we need to first define what is meant by a “valid” prediction. A valid prediction of a device failure is one what occurs within the prediction period associated with that failure, where this period is delimited by a “warning time” and a “monitoring time”. The warning time is the amount of lead-time required for the prediction to be useful and the monitoring time is the maximum amount of time prior to the failure for which a prediction should be considered valid. Note that a

53

37

50

Figure 4b: Interrupt distribution 1 hour prior to failure

49 41

100

Number Minutes Before Failure

59

40

150

1

65

50

200

0

72

70

250

8

Position in Sequence

Figure 3: Future likelihood of interrupts Figures similar to the one shown above have been generated but with “past” thresholds on the x-axis rather than sequence positions. That is, these figures show the probability of seeing n interrupts in the next d seconds given that n’ interrupts have been observed in the last d’ seconds. These

3

monitoring time of 10 years is not reasonable, since most devices will always fail within this time period. Given this formulation, we can precisely evaluate how well a threshold strategy does as predicting failures. Since failures are very rare and given that we can act selectively (i.e., we can decide to replace a circuit pack or not), precision and recall are more appropriate evaluation measures than predictive accuracy. Figure 5 shows how various interrupt threshold values do at predicting device failures given a 20 second warning time and an 8 hour monitoring time.

Conclusions Our analysis showed that while interrupt thresholds may be somewhat effective at predicting the future interrupts (based on Figure 3), they are poor at predicting actual failures. This holds true no matter what the actual threshold value. Unfortunately, because specific cost-benefit information is not available, and because we have no way of predicting the likelihood of phases based on the number of interrupts that occur, it is not possible to pick the optimal threshold value. Nonetheless, we have much more quantitative information than was previously available, and feel confident that the reliance on the existing thresholding strategy is unproductive. Rather than recommending that we totally eliminate the use of thresholds (which would likely run into great resistance), we have decided to take a slightly more conservative strategy. Our recommendation is to increase the thresholds significantly—to 6 interrupts in 1 hour. While this will not improve the ability to predict failures, it will reduce the number of generated alerts by a factor of 10 (see Table 1). Unfortunately, within days of making our recommendation, a decision was made to migrate (i.e., outsource) the maintenance of the 4ESS switch from our internal OSS to one supplied by an external vendor. Even though this migration will take some time to complete, it was decided that no additional effort should be spent on the old OSS. Consequently our recommendations will not be implemented. Nonetheless, we feel that the issues addressed in this paper, and the lessons we learned, are of clear practical value and have widespread applicability within AT&T.

17 9 in 1 day

16

PRECISION

threshold duration: 4 hr

6 in 1 day

15

threshold duration: 1 day

4 in 4 hrs

14

3 in 4 hrs

13 12 7 in 4

11

2 in 4 hrs

4 in 1 day

10 9

1 interrupt in 4 hours

3 in 1 day

8

2 in 1 day

7

1 interrupt in 1 day

6 0

10

20

30

40

50

RE CALL

Figure 5: Efficacy of thresholds at predicting failures Each point on the two curves represents a specific threshold value (each curve corresponds to a single threshold duration and the points on each curve specify the threshold count). Depending on the importance of precision versus recall, we can determine the best threshold strategy. Many strategies were evaluated, but only the two best are shown. In some cases one curve is always above another and in these cases we can say that one strategy always outperforms the other. One conclusion is that the existing threshold value of 3 interrupts in 8 hours is not an ideal threshold value (although it is not too far from optimal). The recall never reaches 60% for the curves in Figure 5 because for more than 40% of the cases, there isn’t a single interrupt alarm within the prediction period of the failure—this alone proves that interrupt alarms are limited in their ability to predict failures. Overall, based on very low precision values (less than 20%), these results indicate that thresholding is not a very effective predictor of failures, and perhaps should play a much less important role than it does in our current systems. Different values of the monitoring and warning times will lead to different results. Our experiments have shown that increasing the monitoring time leads to moderately better predictions and increasing the warning time leads to significantly poorer predictions. Unfortunately, we have not be able to get absolute confirmation that a warning time of 20 seconds is appropriate; if the value turns out to be higher, then the thresholding rules will perform even more poorly at predicting failures than is indicated in Figure 5.

References Merz, C. J., Pazzani, M. J., and Danyluk, A. P. 1996. Tuning Numeric Parameters to Troubleshoot a TelephoneNetwork Loop. IEEE Expert, 11(1), 44-49. Weiss, G. M., and Hirsh, H. 1998. Learning to Predict Rare Events in Event Sequences. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA. Weiss, G. M., Ros, J., Singhal, A. 1998. ANSWER: Network Monitoring using Object-oriented Rules. In Proceedings of the Tenth Conference on Innovative Applications of Artificial Intelligence, AAAI Press. Weiss, G. M., Eddy, J., and Weiss, S. 1998. Intelligent Telecommunication Technologies. In Knowledge-Based Intelligent Techniques in Industry (chapter 8), L.C. Jain, ed., CRC Press.

4