Toward best maintenance practices in ... - Semantic Scholar

8 downloads 124240 Views 127KB Size Report
maintenance and repair of data and voice gears.2. The past few .... related to maintenance procedures, including software ... disaster recovery planning and return/repair services. ..... appliances can be acquired for remote manage- ment of ...
INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT Int. J. Network Mgmt 2005; 15: 321–334 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/nem.576

Toward best maintenance practices in communications network management By Faouzi Kamoun*,† Best maintenance practices in communications networks management are benchmarking standards that, if carefully implemented, will enhance the integrity, reliability and maintenance costs of communications networks. This paper defines best maintenance practices in communications network management within a concise framework encompassing measurable performance-level goals as well as methods and procedures needed to achieve these goals. The best maintenance practice recommendations of this paper cover many segments of communications networks and services, including wireline and wireless networks as well as multiple network spaces and technologies. The paper also outlines certain recent trends in network maintenance and provides specific recommendations for action to be taken to ensure that best network maintenance practices are implemented and maintained. Copyright © 2005 John Wiley & Sons, Ltd.

Introduction

T

raditionally, network operators and administrators have focused their attention primarily on the operational, administrative and provisioning aspects of network management, while deeming the maintenance aspect of their communications networks a necessary burden. As a result, several studies (e.g. reference 1) have confirmed that the severe traffic congestion, poor performance and unacceptable availability exhibited by many communications networks are often due to poor network maintenance. Yet everyone agrees that, being the backbone of corporations, communication networks need to be kept operating at peak performance around the clock, 24 hours a day, 7 days a week.

In fact, as enterprises and customers count on the corporate network’s availability, reliability and quality of service, any compromise in these areas will lead to both decreased revenues and increased costs. As an example, the Telecommunications Industry Association reported that for the year 2001 US enterprise spent around $34 billion on the maintenance and repair of data and voice gears.2 The past few years have witnessed a paradigm shift with respect to the general attitude of corporations toward communications network maintenance. Many organizations have begun to realize the need to integrate their network maintenance planning and operations with their corporate and business strategies, to ensure that these maintenance functions are aligned with the mission of the organization. Some of the more important factors

Faouzi Kamoun: holds B.Eng, M.A.Sc, and Ph.D degrees in Electrical and Computer Engineering from Concordia University, Canada, and an MBA degree in Management from McGill University, Canada. He joined the College of Information Technology of Dubai University College in September 2002, as an Assistant Professor. Previously he had been with Nortel Networks (Montreal, Canada) since 1995, where prior to his departure in 2002, he was a Senior Technical Advisor in the Hi-CAP Optical Networks Division. His research interests are in the management, modeling and performance analysis of next generation communication networks. He was the recipient of Nortel Networks CEO top-talent awards in 2000 and 2001, the Concordia University Graduate Fellowship from 1991 to 1993, and the Electrical Engineering medal for most outstanding graduating student at Concordia University in 1988. *Correspondence to: F. Kamoun, College of Information Technology, Dubai University College, PO Box 14143, Dubai, United Arab Emirates. † E-mail: [email protected]

Copyright © 2005 John Wiley & Sons, Ltd.

322

that have contributed to this change in attitude toward communications network maintenance include:

T

he past few years have witnessed a paradigm shift with respect to the general attitude of corporations toward communications network maintenance.

• ever-increasing pressures to cut down on rising costs of network OAM&P, due to increasing global competition; • the greater role a corporate network plays in supporting corporate and business strategies, fuelled in turn by the greater role the Internet, intranet and extranet play in daily business processes; • the rising opportunity cost of not delivering reliable communications services, which can force companies to lose many online customers and jeopardize relationships with suppliers and business partners; • additional complexities introduced by (1) fast-moving and intricate networking technologies, (2) new applications with stringent security requirements, (3) heterogeneous network environments employing multivendor equipment, and (4) global telecommunications systems with dispersed yet integrated equipment components; • the steadily increasing strain on network devices due to a larger number of users and the emergence of real-time and bandwidthhungry applications such as streaming video, graphics and converged multimedia services. In light of the above factors, it becomes evident that that strict adherence to concrete, attainable best practices is a fundamental requirement for effective communications network maintenance. In fact, in the absence of well-documented and institutionalized best practices, the organization’s network maintenance effectiveness will be in limbo, lacking a specific sense of direction. It is the aim of this paper to explore the most important academic and empirical issues that can help achieve best maintenance practices for communications networks.

Copyright © 2005 John Wiley & Sons, Ltd.

F. KAMOUN

The remainder of this paper is organized as follows: in the next section we define best practices in communications networks management in the context of benchmarking standards and 13 associated methods and strategies. In the third section we revisit these 13 methods and for each of them we explore certain best practices specifically vectored toward the maintenance of communications networks. In the fourth section we provide recommendations for what needs to be done to ensure that best practices are followed. The final section provides a summary of the paper.

Best Maintenance Practices Defined While the term ‘best practices’ can mean different things to different people, there is a general consensus that this term involves tracking the practices of other companies in order to compile a list of benchmarking standards that constitute stretched goals. The most important thing to keep in mind in this regard is that to be meaningful best practices should have three main characteristics. First, well-articulated best practices should be precise and measurable. Failure to comply with this requirement will prevent companies from assessing their progress in attaining them. A second characteristic of well-articulated best practices is that they should be challenging, but realistic and preferably provable. A third characteristic of well-drafted best practices is that they should specify time constraints. These deadlines act as catalysts for motivation and put a sense of urgency into the process of achieving best practices. Further, these best practices do not become effective until implemented, and thus the transition towards them requires careful planning, considerable effort and, most importantly, a strong commitment from top management and other levels of the organization. To compile a list of benchmarking maintenance standards for communications networks, one has to address two questions: namely what needs to be improved and how to initiate these improvements. To address the first vital question, we can define 13 generic yet integrated functional areas that constitute the major facets of best practice in communications network maintenance. These areas, previously defined in the context of manufactur-

Int. J. Network Mgmt 2005; 15: 321–334

TOWARD BEST MAINTENANCE PRACTICES

323

ing plant maintenance (see, for example, references 3–6) will be used herein within the context of communications network management. These 13 functional areas, depicted in Figure 1, provide a general framework under which we can:

T

o compile a list of benchmarking maintenance standards for communications networks, one has to address two questions: namely what needs to be improved and how to initiate these improvements.

• evaluate the current status of network maintenance practices and policies; • define benchmarking standards for best network maintenance practices; • diagnose the current status of network maintenance with respect to the above goals; • determine the difference between the current status and the goals (the ‘gap’). Once the gap is identified, one can use these 13 functional areas as a general framework to address the second important (how) question by finding those operators best suited to closing the gap.

Methods and Strategies for Best Maintenance Practices In this section, based on the 13 functional areas depicted in Figure 1, we outline a number of network maintenance practices that can be used as benchmarks.

—Network Maintenance Training and Skills Development— Continuous training and retraining of network maintenance technicians should be an integral part of strategic planning for the maintenance department. First, the current knowledge level and skills of network maintenance department employees must be assessed through written and practical tests. Then the required skills for world-class network maintenance must be identified. Finally the gap between required skills and available skills must be evaluated to sketch a roadmap for training needed to close the gap. Some of the key maintenance training skills that need special reinforcement are highlighted in Table 1. Network operators and service providers should establish ‘partnerships’ with their telecom equipment suppliers to develop and refine training programmes for their products. These training

Continuous Improvement

Scheduling, cooperation and coordination Predictive maintenance procedures

Post-mortem and Root-cause analysis

Preventive maintenance procedures

Managing contractors

Computerized Management Maintenance Systems (CMMS)

Financial optimization Inventory control, asset-management and procurement

Maintenance policies Organizational structure Maintenance training Skills developments

Work-flow and control L Leadership / executive support

Figure 1. Thirteen functional areas covering maintenance methods and strategies

Copyright © 2005 John Wiley & Sons, Ltd.

Int. J. Network Mgmt 2005; 15: 321–334

324

F. KAMOUN

Type of maintenance training

Maintenance skills to be developed

Systems training

• Fluid communication and coordination among all personnel involved in maintenance operations • Familiarity with the efficient use of a computerized management maintenance system (CMMS) • Technical education related to the deployed technology, its risks and benefits • A thorough understanding of the business impact of service interruption due to equipment failure • Enhanced analytical and problem-solving skills via a common approach methodology • Teamwork training, including communications/leadership skills and workload management • A thorough understanding of equipment hardware and software configurations, and associated redundancy (in power supplies, timing units, processors, switching fabrics, interfaces) • Quick and systematic trouble detection, fault isolation and recovery • Familiarity with the policies and procedures related to backup of critical databases in removable storage devices • Familiarity with the equipment supplier documentation related to maintenance procedures, including software upgrades and reconfigurations • Familiarity with proper usage of network test equipment for performance monitoring and fault isolation. • Full awareness of internal conventions used to label equipment, cables, patch panels and inventory storage areas • Full knowledge of maintenance safety procedures and internally adopted escalation processes

Technical training

Equipment-specific training

Process training

Table 1. Selected maintenance training skill requirements

programmes must be revised and updated as new software or hardware features are introduced and should cover both local as well as remote maintenance operations. The training programme should also cover the usage of documentation related to maintenance and troubleshooting procedures as provided by the equipment supplier. Further, companies who follow best practices in network maintenance often dictate a minimum prerequisite in terms of training courses and field maintenance experience before allowing personnel to carry out maintenance activities on their networks.7 This is particularly important when a new technology is being introduced. As a recent trend in network maintenance, periodic simulated exercises have become an integral

Copyright © 2005 John Wiley & Sons, Ltd.

part of the maintenance personnel’s continuous training programme.7 These maintenance simulation exercises, conducted without prior notice, can also help to assess the readiness of the maintenance department for coping with emergency situations. A post-mortem meeting with the maintenance team can bring to light lessons learned during the simulated exercise and can help outline new measures for improvement. Another trend is training for error-provoking factors in network maintenance. This includes educating maintenance staff about limitations of human performance and short-term memory, impact of pressure and stress, and types of maintenance-induced failures due to human error.8

Int. J. Network Mgmt 2005; 15: 321–334

TOWARD BEST MAINTENANCE PRACTICES

—Leadership and Executive Support— The transition toward implementing best maintenance practices in communications network management does not happen overnight. Rather, it requires time, effort, planning and, above all, strong commitment from all levels of the organization, starting with top management. In particular, strong leadership that eloquently articulates firm commitment and support for world-class maintenance standards is a vital prerequisite to the success of any shift toward network maintenance excellence. Once this organizational shift is endorsed by the maintenance organization, a plan of action should be developed in consultation with the maintenance personnel, and then submitted to top management for review. The final approved plan must be promptly initiated, executed and continuously audited and monitored for control. Strong leadership is also a prerequisite for the adoption of a comprehensive and quality-oriented maintenance strategy, based on such approaches as reliability centred maintenance (RCM) or planned maintenance optimization (PMO9). Further, besides its role in providing direction and focus, a strong executive sponsorship enables the network maintenance programme to overcome potential resource barriers that may impede its successful implementation.

—Organizational Structure and Maintenance Policies— Designing proper organizational structure and control is vital for creating and sustaining a worldclass network maintenance programme, since this structure (1) helps coordinate maintenance activities among network operations centre (NOC) personnel, reliability engineers and maintenance staff, (2) provides network maintenance staff incentives for meeting world-class maintenance standards and (3) ensures proliferation of the maintenance philosophy within the entire organization. A good organizational structure also reflects ‘ownership’ and ‘accountability’ of maintenance roles, and should ideally separate maintenance planning from the maintenance execution function to preclude conflicts of interest. Equally essential to the organizational structure are well-documented

Copyright © 2005 John Wiley & Sons, Ltd.

325

maintenance policies, including those related to disaster recovery planning and return/repair services. These policies should cover regulatory behaviour, safety procedures, network escalation rules and maintenance personnel job definition/requirements, among others. Maintenance policies should also outline well-articulated, accurate naming conventions for devices, interface ports, patch panels and racks. These policies should also stress that network maintenance is not the sole responsibility of the network maintenance department and that all network stakeholders (e.g. users, network installers, technicians, network operations personnel, administrators and building maintenance staff) have some ‘ownership’ and involvement in enhancing and maintaining the reliability of the equipment. For example, network installers should perform careful inspection and unit/system testing of networking equipment and cables before system turn-up. Network technicians should report any observed abnormal behaviour such as uncovered shelves, noisy processors and abnormally overheated/vibrating equipment by completing the appropriate maintenance work order. Further, the building’s maintenance personnel should periodically inspect the main power distribution system, along with the heating, ventilation and air-conditioning systems of the facility.

D

esigning proper organizational structure and control is vital for creating and sustaining a world-class network maintenance programme.

—Managing Contractors— An important consideration in implementing best network maintenance practices is to decide whether network maintenance should be developed ‘in-house’ or whether it should be outsourced. Outsourcing to a third party can be arranged with a specialized maintenance firm or with the equipment supplier itself. A strategic (as opposed to ad hoc) outsourcing of maintenance services can potentially help in reducing staff requirements and provides relief from the burden

Int. J. Network Mgmt 2005; 15: 321–334

326

of recruiting, training and retaining a network maintenance workforce. It also enables organizations to refocus their resources on strategic business activities, thus reducing operating costs and increasing revenues. The final decision should be based on many factors, such as potential cost savings, strategic importance of the network maintenance function, quality of outsourced maintenance services, switching costs, the calibre of in-house staff and the overall stability of the communications network environment. Nonetheless, when opting to outsource maintenance activities to an external service provider, it is important to check whether this service provider (1) has a proven track-record, backed by external referrals, (2) has past experience in maintaining multivendor equipment on a global scale, (3) has a strong financial profile and (4) has the resources and capabilities to perform secure remote monitoring and fulfil network availability service requirements. When planning to contract network maintenance to a potential equipment supplier, it is important to bear in mind that good management of the maintenance contractors starts by asking the right questions in the request for proposal (RFP) that has been submitted. As a precaution, bidders should be requested to outline their own proposal for the preventive and corrective maintenance of the proposed network. The proposal should also highlight the mechanisms used to coordinate corrective maintenance tasks and the average response time. Further, network operators can either opt to outsource all maintenance tasks to equipment suppliers, or request 24/7 back-end support service for the in-house maintenance staff. As part of the RFP, equipment suppliers should also be asked to indicate their willingness to comply with certain other important maintenancerelated requirements, such as: • guaranteeing a predefined maximum mean time to repair (MTTR); • dispatching personnel to perform corrective maintenance upon request; • providing a detailed description of the proposed online technical support; • providing all necessary tools and spare parts required to correct any fault; • specifying and describing the networking infrastructure covered by the maintenance

Copyright © 2005 John Wiley & Sons, Ltd.

F. KAMOUN

contract (equipment, UPS system, wirings, timing sources, power distribution panels, cable distribution panels, racks, A/C systems, etc.); • specifying the guaranteed maximum spareparts delivery time; • describing the mechanisms used by the repair and replacement service for all hardware faults. It has been common practice for equipment vendors to increase contractual maintenance fees by up to 15% for old telecom equipment, as a natural move to decrease their service costs and encourage customers to upgrade to newer products. In this case, network operators may either opt to contract another third-party firm to reduce their network maintenance cost, or absorb the fee increase and carry on with the manufacturer for the sake of better service and minimal disruption to existing modes of operation.

—Financial Optimization— When targeting financially optimized maintenance decisions, the maintenance organization should take into account the fact that seamless network operation is often unachievable and that, beyond a certain point, maintenance costs can escalate out of proportion compared with service quality gains.10 As a result, best-in-class organizations use sophisticated data-gathering and statistical techniques to collect and process accurate data concerning their networking assets. This includes information related to mean time between failures (MTBF), mean time to repair (MTTR), end-of life (EOL), historical performance monitoring counts, service level agreement (SLA) parametric values and downtime cost. The resulting information is then used to generate financially optimized assessments and forecasts outlining the appropriate frequency of scheduled backups, the equipment-obsolete level, timetables for asset replacement or upgrades, and the appropriate number of spare parts to be kept in the inventory. The same information can also be used to forecast the right size for the maintenance organization and to ensure that the maintenance backlog does not reach an unmanageable size. Pareto analysis can also be used to reduce maintenance costs or improve equipment availability by focusing on those failure codes most frequently

Int. J. Network Mgmt 2005; 15: 321–334

TOWARD BEST MAINTENANCE PRACTICES

responsible for downtime. Very recently, logarithmic scatterplots and jack-knife diagrams11 have been proposed as better alternatives to classify faults, determine downtime priorities, facilitate root cause analysis and prepare maintenance budgets. Another common practice for decreasing maintenance costs and improving maintenance effectiveness is the adoption of centralized maintenance organizations that rely on extensive use of automated remote reporting, monitoring and control (see, for example, reference 12).

327

reorders based on spare units supply lead-time and current stock levels. In addition, access to the warehouse should be secure and authorized only for designated personnel in the warehouse department. Faulty cables and equipment should be carefully tagged and, if possible, sent for repair. Non-reparable cables should be scrapped and destroyed to avoid having them redeployed again by mistake.

—Work Flow and Control— —Inventory Control, Asset Management and Procurement— Efficient inventory and procurement plans aim to provide the right number of spare components, in the right place at the right time. Since a very common cause of extended outage periods is the unavailability of spare circuit packs or test equipment needed to restore traffic, it is essential to establish a well-managed and thoroughly documented process to (1) track the location of all spare units, (2) allocate, secure, ship and deploy spare equipment, and (3) speed up the order of spare equipment from suppliers when it is not locally available. In addition, to optimize the usage of spare parts in emergency situations, it is a good practice to consider allocating ‘hot spares’ for critical units that represent single points of failure in the network. Recall that hot spares are circuit packs which are plugged into network elements, as opposed to being stored in cabinets.7 Another world-class maintenance practice is to try as much as possible to shift the maintenance inventory cost to equipment suppliers, by adopting JIT-based inventory policy through negotiated basic order agreements with suppliers. These agreements should also cover guaranteed lead time for certain critical maintenance inventory items. Auditing and real-time monitoring of the warehouse inventory levels must be performed regularly to identify excess maintenance inventory items and ensure adequate spare levels for emergencies. Spare parts must be bar-coded and entered into the computerized management maintenance system (CMMS) so that maintenance personnel can easily locate them. The CMMS should be capable of automatically triggering purchase

Copyright © 2005 John Wiley & Sons, Ltd.

The use of a work order system for the initiation, planning, scheduling, executing, recording and tracking of network maintenance activities is an integral part of best maintenance practices in world-class organizations.4 More important than the work order system itself is the need to use it comprehensively to record and keep track of all maintenance activities. The maintenance procedure captured in the work order system should cover the following items: • safety issues/precautions related to maintenance tasks; • a list of required tools and equipment; • detailed maintenance steps based on best practices; • a pre-check procedure to ensure that the equipment is in a known state and is ready for maintenance, as well as a post-check procedure to confirm that maintenance has been successfully completed; • rollback (undo) procedures if applicable; • level of service interruption (if any) resulting from maintenance activity; • any required supporting documents (such as trouble clearing and module replacement or alarm reference guides); • contact details of the supplier’s remote technical support service, for emergencies; • estimated maintenance window duration and total efforts required. Work orders should also be reviewed for approval, prioritized according to their urgency, and then scheduled and assigned based on available resources. Though most network maintenance departments strive to implement best practices, some of them neglect the importance of fully documenting

Int. J. Network Mgmt 2005; 15: 321–334

328

F. KAMOUN

these practices. This negligence can be ascribed to the belief that maintenance personnel have accumulated the technical expertise and experience to successfully perform these best practices with minimal documentation. This situation can become a liability if experienced network maintenance engineers move to another company or retire. For this reason, it is essential that detailed work orders and maintenance procedures display the accumulated maintenance experience as well as best practices. This will ensure continuity in delivering world-class network maintenance services. Numerous studies (e.g. reference 13) have shown that maintenance-induced failures due to human error in maintenance are inevitable. This important fact should be borne in mind when documenting maintenance task instructions. Reason and Hobbs13 classified maintenanceinduced failures as those related to recognition failures, memory failures, skill-based slips, rulebased mistakes, knowledge-based errors and deliberate violations. In light of these factors, carrying a proactive risk assessment to identify the likelihood of human error and refining the maintenance task instructions accordingly are strongly recommended. A common pitfall that often results in outages is the reluctance of maintenance technicians to follow documented instructions step by step, and their reliance instead on a customized, abridged version of the recommended maintenance procedure. Such short-cuts should be discouraged and flagged as unacceptable practices. Additional tips for writing effective maintenance work instructions include (1) keeping in mind the person who will follow the instructions, (2) grouping complex and logically related tasks into coherent phases, (3) using simple and consistent language, (4) communicating risk-avoidance steps clearly and strongly, and (5) using graphic illustrations whenever applicable.8

—Computerized Maintenance Management System (CMMS)— The efficient and comprehensive use of a CMMS is an essential prerequisite for properly planning, tracking, controlling and evaluating network maintenance transactions. When fed with mean-

Copyright © 2005 John Wiley & Sons, Ltd.

ingful and accurate data, the CMMS can play a key role in optimizing work-order management, including enhanced planning, scheduling and cost control. Further, maintenance organizations should not count on the CMMS to resolve poor maintenance processes. They should instead strive to improve these practices before automating them via a CMMS. Some of the critical success factors for a CMMS include ease of use, a moderate learning curve, management support and built-in graphics capabilities.14 In addition, some of the obstacles that may hinder the efficient implementation of a CMMS include lack of CMMS goals and planned outcomes, lack of integration, lack of a comprehensive maintenance strategy, inaccurate data and lack of accountability. Some recommendations to overcome these obstacles can be found in reference 14.

—Post-Mortem and Root-Cause Analysis— Network performance degradation and failures can often be rooted in human errors or in selfinduced equipment failure. In either case, for quality assurance, it is imperative that once corrective maintenance is completed a postmortem analysis is performed to investigate the root cause of the failure and suggest improvements. These findings and recommendations should be fully documented, and shared with maintenance personnel and equipment suppliers, if applicable. For example, a root-cause analysis of the abnormal behaviour of recently installed data routers may reveal potential reliability issues with a specific interface. The findings of the analysis should then be communicated to the supplier for further investigation and should be followed up for potential maintenance patches, updates or workarounds. Post-mortem analysis can also help the network maintenance organization to elaborate a list of ‘dos’ and ‘don’ts’ for maintenance. To optimize the learning process from past maintenance-related failures due to human error, it is essential to foster an organizational culture where maintenance personnel are encouraged to freely report maintenance-induced mistakes and failures, whatever their severity. Indeed, it has been found15 that a high level of failure reporting

Int. J. Network Mgmt 2005; 15: 321–334

TOWARD BEST MAINTENANCE PRACTICES

329

is one the main characteristics of world-class maintenance organizations.8

—Continuous Improvement— Carrying out internal audits, setting performance indicators and benchmarking standards for best network maintenance practices are all critical steps toward exploring opportunities for improvement. World-class maintenance organizations constantly question the status quo as they continually endeavour to find ways to streamline their maintenance operations, usually by effecting small, incremental changes. Further, the maintenance organization needs to be kept abreast of recent developments in network maintenance in order to assess their potential adoption. These include, for instance, migration toward integrated web-based maintenance tools, customer service-oriented and business-oriented management, and emerging management architectures such as the ITU-T model for telecommunications management network (TMN16), which aims to support integrated management across multiple public service networks such as SONET, ATM, Frame Relay, mobile and telephone networks.

—Scheduling, Cooperation and Coordination— Most network maintenance tasks require careful coordination and cooperation among the network operating centre (NOC) personnel, maintenance engineers and warehouse personnel. For this reason, it is important that the maintenance personnel be acquainted with the communication processes, maintenance guidelines and escalations policies. Preventive and reactive network maintenance tasks should be prioritized based on their urgency, which is often embedded in the severity of the associated maintenance alarms, as shown in Table 2.

Since an equipment failure can generate additional unwanted maintenance alarms, which in turn can trigger unnecessary maintenance activities, there should be ways of properly interpreting, screening and preventing these unwanted alarms. These would include the proper use of event correlation techniques, or simply referring to the troubleshooting section of the manuals that came with the equipment. Network maintenance tasks should be carefully scheduled during appropriate maintenance windows when potential traffic disruption would have minimal effect on users. Further, end users should be notified of scheduled maintenance activities before they go into effect. An e-commerce site, for instance, should have a banner to notify customers of planned maintenance activities and warn them about possibly degraded or unavailable services. It is also good practice to assign a flight director to manage maintenance notices to make sure that these are communicated on time. The flight director will also be responsible for the planning, scheduling, communication and control over the progress of the maintenance task at hand.17 Cooperation and coordination in the context of international maintenance organizations can be a real challenge if these operations are not properly planned. For this reason, various standard international documents (e.g. ITU-T M.7518) recommend the appointment of a ‘technical service’ authority within the central administration. The mandate of this authority would include (1) making international agreements between administrations on technical and engineering aspects of maintenance, (2) allocating responsibilities to maintenance units within the administration, (3) outlining maintenance policies and overseeing their implementation, and (4) advising other administrations of planned service interruptions in their own countries due to maintenance activities.

Maintenance alarm type

Type of action

Prompt maintenance alarm (PMA) Deferred maintenance alarm (DMA) Maintenance event information (MEI)

Immediate maintenance activity is usually required Immediate action is not required Non-service affecting event notification with no immediate action required

Table 2. Maintenance alarm information12

Copyright © 2005 John Wiley & Sons, Ltd.

Int. J. Network Mgmt 2005; 15: 321–334

330

F. KAMOUN

—Predictive Maintenance Procedures— Part of the maintenance organization efforts should be vectored toward the use of predictive tools to help in monitoring the ‘health’ of the communications equipment. For this purpose, the NOC should make full use of management system capabilities imbedded in the communications and network management protocols. For instance, for data networks, the Simple Network Management Protocol (SNMP) allows users to probe managed devices and retrieve various performance statistics. Historical performance data stored in the NMS database can be analysed to determine performance trends, predict potential future degradations and schedule preventive maintenance tasks. It is generally recommended19 that historical records should be retained for at least 12 months to allow the identification of persistent degraded and faulty conditions. A well-crafted predictive maintenance process, whether based on statistical or analytical identification methods, should cover three concurrent levels of supervision. These are the supervision process for anomalies (short period), the defect supervisory process (medium period) and the malfunction supervisory process (long period).12 The network engineer should identify those critical devices that need special performance monitoring and provision thresholds on such items as CRC error rate, packet-discard rates, fan status and selected test-point temperatures. Once these thresholds are set, the NMS can report any threshold-crossing alerts to the network engineer. This type of early detection via threshold-crossing alerts helps maintenance engineers plan proactive maintenance activities to avoid potential problems or improve performance. It is also crucial that the network engineer be familiar with the various enterprise-specific SNMP traps supported by equipment vendors. These traps, when enabled, enhance the process of fault monitoring, detection and notification. System log (syslog) messages can also be gathered, filtered and subsequently analysed for the purpose of early fault detection and easier troubleshooting. Gathered performance-related data can be analysed and compared with preset standards or baseline values to identify major discrepancies. Remote monitoring (RMON) probes can also be used to monitor exten-

Copyright © 2005 John Wiley & Sons, Ltd.

sive performance data on selected Ethernet segments, while allowing more frequent probing as well as minimizing network load. Using RMON alarms and events, a managed device can monitor itself for rising and falling thresholds, sending the appropriate SNMP trap for pro-active management.20 A more recent trend is the usage of an event management system (EMS) in large networks to correlate various network events coming from syslogs, traps and log files for more effective monitoring at the NOC. Another trend is the usage of network simulation tools to optimize predictive maintenance decisions in real time.21 For time division multiplex (TDM) networks based on SONET/SDH physical layers, extensive maintenance information is embedded in the overhead portion of the SONET/SDH frames. In addition, for all-optical networks, maintenance capability is embedded in the optical supervisory channel (OSC). In either case, it is good maintenance practice to make full use of the performance monitoring capabilities provided by the OSC or the SONET/SDH overheads and provision threshold crossing alerts for critical performance measures. Examples of such performance measures include total power received, laser bias current, code violations, errored seconds, severely errored seconds and unavailable seconds. For optical telecommunications networks, alarm systems should be deployed to measure degradation of the received optical signal before serious outages occur. Examples of such alarm systems include embedded optical time-domain reflectometer (OTDR), power measurement systems, polarization mode dispersion (PMD) measurements units and bit error rate (BER) measurement systems. For instance, an OTDR-based monitoring system can periodically measure the reflected loss of a fibre and compare this to a baseline value. The system goes into alarm when the measured reflection exceeds the provisioned threshold value. It is also a good maintenance practice to monitor the condition of the protective elements of the fibrepackaging shield and the splice enclosures for better pre-warning of fibre malfunction.22 For international telephone circuits, routine end-to-end maintenance measurements related to overall loss, noise power, distortions, round-trip delay and echo should be taken periodically. Any significant deterioration in performance from the original line-up values can indicate potential faults

Int. J. Network Mgmt 2005; 15: 321–334

TOWARD BEST MAINTENANCE PRACTICES

and thus may require readjustment. ITU-T standard M.61023 outlines specific guidelines for the recommended periodicity of these routine maintenance measurements. In addition to periodic measurements, service quality observations that aim to assess the quality of telephone calls should be conducted either manually (by an observer without a data-recording machine), semi-automatically (with an automatic data-recording machine) or automatically (without an observer).19 A recent trend in predictive maintenance of distributed telecommunication systems is the application of expert systems, operations research methods, distributed sensors and machine learning for mechanized trouble analysis and dispatch.24–26 For instance, today, it is common practice for some equipment providers to offer their customers remote expert monitoring and maintenance services, based on machine-tomachine connections that rely on AI systems. In reference 27 the authors analysed data from more than 450000 trouble tickets on 68000 systems in service and reported that networks that experienced a major problem had approximately 65% fewer outages when being remotely monitored using expert systems and AI algorithms.

—Preventive Maintenance Procedures— Instituting a genuine and effective preventive network maintenance programme is a key step towards achieving world-class maintenance status. First, the network maintenance organization needs to review and be familiar with the manufacturer’s recommended preventive maintenance procedures. In addition, the organization should use its past experience to build additional maintenance tasks that it deems necessary. Many international standards’ (e.g. ANSI,28 Telcordia,29 ITU,30 IETF31) technical documents provide a wealth of information and recommendations concerning maintenance philosophy, concepts and methods, which should ideally be incorporated into the organizational preventive maintenance programme. The remainder of this section outlines a number of preventive network maintenance practices for world-class network maintenance organizations.

Copyright © 2005 John Wiley & Sons, Ltd.

331

Boundary checks—Visual inspections of cable ways, air-conditioning/ventilation areas, connectors and electrical systems should be scheduled on a regular basis. For telecom equipment power systems, it is good practice to use smart controllers to continuously monitor the UPS system and batteries.7 It is also important to ensure that the area where equipment is housed fulfils environmental (e.g. humidity, temperature) requirements for the respective devices and that these are monitored and alarmed. A building management system can be used to automatically monitor the critical elements of the network infrastructure. These include UPS systems, electrical outlets with branch circuits and feeders, cooling devices and rack temperature, among others. Today, off-the-shelf environmental appliances can be acquired for remote management of access and environmental conditions. This remote activity covers sensor-based access monitoring, shock and vibration monitoring, temperature and humidity monitoring, remote equipment rebooting/power recycling, and customizable input contacts and output relays. Further, since network synchronization is crucial for the proper performance of digital communication networks, both the central office primary reference source (PRS) and the timing signal generator (TSG) should be continuously monitored. The synchronization equipment should also be provisioned to generate autonomous alerts and events to the central NMS.32 If the maintenance task requires shutdown of some network devices, it is important for the maintenance organization to be familiar with the boot sequence and the dependency of a particular device on other servers during boot-up and shutdown.17 Maintaining an ordered shutdown and reboot list for all devices/servers is a good maintenance practice. Logistics checks—At the start of the maintenance window, it would be good practice to disable or discourage local and remote system access and provide notifications to users that the system is in the middle of a maintenance window. The voice-mail of the helpdesk should announce that the system is in a maintenance window and should notify when the service will be re-established.17 Care should also be taken to ensure that all equipment is properly labelled and that these Int. J. Network Mgmt 2005; 15: 321–334

332

labels are not placed on removable covers.7 In addition, for submarine telecommunications systems, since cable maintenance cost dominates the total network maintenance cost, it is important to update marine maps when a new cable is installed and to distribute these maps among seamen. Shore-based radar should also be used to monitor and warn ships crossing the cable route.

Safety checks—It is crucial that maintenance personnel adhere to strict regulations regarding safe work practices. As an example, recently the National Institute for Occupational Safety and Health (NIOSH) requested assistance in averting the high risk of fatal falls of workers involved in maintenance of wireless telecommunications towers.33 Investigation of seven deaths resulting from falls during construction and maintenance of telecommunications towers revealed that lack of awareness of the serious safety hazards related to the task, and unavailability of well-documented safety procedures were the main causes of the fatalities. Software backups—Since data corruption and loss of critical equipment databases can have devastating consequences on network performance, it is important to outline policies and procedures for scheduled backups of these databases onto removable media, such as tapes and disks. While backup procedures should be fully documented, validated and automated to minimize human errors, backup policies should highlight the conventions used for labelling, handling and storing the removable media. Further, keeping a well-managed inventory of backup tapes is essential to performing file restoration in a timely manner. A backup software system can be deployed to maintain an online fileby-file inventory of back-up tapes and to instruct the maintenance personnel to destroy a tape that has been reused too many times. Backup tapes should be stored in safe, locked, secure areas. To prevent loss or damage in case of natural disasters, it is also good practice to keep an off-site storage facility for the backup media. Diagnostic checks—An intricate task in preventive network maintenance is to look for silent equipment failures. These are ‘dormant’ failures that could cause outages when activated. For instance, switching fabrics in optical switches are

Copyright © 2005 John Wiley & Sons, Ltd.

F. KAMOUN

normally duplicated, with a primary and a secondary switch. If the standby switch fails and this failure goes undetected, then any subsequent failure of the primary switch may cause an outage. For protected SONET/SDH equipment, it is good maintenance practice to schedule the ‘exerciser’ request to run on a regular basis to warn against silent failures of the switching modules. Recall that the exerciser’s request uses bits 1–4 of the K1 SONET/SDH overhead byte to exercise the protection functionality of switching fabrics in a way that does not affect service. Further, for dense wave division multiplex (DWDM) optical networks, it is good preventive maintenance practice to use optical spectrum analysers (OSAs) to detect potential non-linear degradation effects in the fibre plant. Embedded remote fibre test systems (RFTS) can also be used to continuously monitor the physical status of the outside fibre plant.34–35 As part of regular preventive maintenance, automatic test routines, such as LED tests, should also be run on the equipment.36 A more recent trend in network preventive maintenance is the use of the geographic information system (GIS), coupled with special database management systems and multimedia technologies, for better documentation management, and more effective fault location and network maintenance (see, for example, references 37, 38).

Recovery and fine-tuning procedures—Following network performance degradation or failure, affected devices should be remotely instructed to automatically initiate restoration mechanisms, such as reconfigurations and traffic rerouting.12 Prior care must be taken to ensure that the alternate back-up route for traffic restoration has similar characteristics (e.g. end-to end delay) to those of the original path. In addition, ‘offline’ network optimization strategies should be explored to further enhance network bandwidth efficiency. These optimization procedures include post-restoration strategies,39 load balancing40 and link defragmentation.41

Monitoring and Control The successful implementation of best predictive and preventive network maintenance practices will generally undergo various stages,

Int. J. Network Mgmt 2005; 15: 321–334

TOWARD BEST MAINTENANCE PRACTICES

333

Best Practice #18—Visual inspections of air-conditioning and ventilation areas should be scheduled on a regular basis Type: Preventive Consequences of not following best practice: Air-conditioning/ventilation systems can potentially be contaminated and clogged, reducing or eliminating air-flow. This can reduce the end-of-life expectation of electronic/optical components by up to 20–40% and trigger equipment failures. Test Procedure: Take a look to the site’s air-conditioning and ventilation areas and check • their condition (including the compressors and fuses). Go through the inspection report and check that the recorded inspection • dates are set according to the planned schedule. Fail Test Result: Pass Additional notes/actions:

Figure 2. A sample abridged test case

including the planning phase, compilation of best practices, documentation, and finally the dissemination and implementation phase. This whole process is both incremental and iterative and should be carefully monitored and controlled to ensure that best practices are both followed and implemented. Internal audits can play a key role both in measuring progress and in monitoring deviations from best maintenance practices. Further, for these audits to be efficiently implemented, and whenever applicable, validation test cases should be written against predictive and preventive best practices and executed accordingly. A sample abridged test case is illustrated in Figure 2.

Conclusion In this paper, we have attempted to define best practice in communications network management in the context of benchmarking standards and have indicated 13 associated methods and strategies. For each of these 13 functional areas, we have outlined best practices specifically vectored toward the maintenance of communications networks. By scrutinizing the important academic and empirical issues related to communications networks maintenance, it is hoped that this work will trigger more investigations and case studies related to this important topic in network management.

Copyright © 2005 John Wiley & Sons, Ltd.

Acknowledgement The author would like to thank Dr Milton Knutson for proofreading the final version of this manuscript.

References 1. ITU Buenos Aires Action Plan, Work programme of the Telecommunication Development Sector for the period of 1994–1998. Programme 7. Improvement of Maintenance. http://www.itu.int/ITU-D/bdtint/ baap/sec1_07.html [8 November 2004]. 2. http://www.tia.org.uk [8 November 2004]. 3. Best Maintenance Practices in Facility Management, White paper, Life Cycle Engineering, Inc., 2001 http://www.reliabilityweb.com/excerpts/excerpts /best_practices_fm.htm [8 November 2004]. 4. World-Class Status with Best Practices in Maintenance Management, White Paper, Invensys Systems, Inc., 2002. http://www.invensysips.com/products/ avantis/downloads/World-Class%20Status%20 with%20Best%20Practices.pdf [8 November 2004]. 5. Hiatt BC. Best Practices in Maintenance: A 13 Step Program in Establishing a World Class Maintenance Organization, White paper, Anesta Corp., 2003. http://www.tpmonline.com/articles_on_total_ productive_maintenance/management/13steps.htm [8 November 2004]. 6. Smith R, Hawkins B. Benchmarks of maintenance organisation effectiveness. Maintenance Journal, Engineering Information Transfer Pty Ltd 2003; 16(3): 32–35.

Int. J. Network Mgmt 2005; 15: 321–334

334

7. Network Reliability and Interoperability Council V: Network Reliability Best Practices, Final Report, http://www.nric.org/pubs/ [8 October 2004]. 8. Dunn S. Managing human error in maintenance. Maintenance Journal, Engineering Information Transfer Pty Ltd 2004; 17(3): 12–17. 9. Turner S. PM Optimisation maintenance analysis of the future. In International Conference of Maintenance Societies (ICOMS), Melbourne, 2001. 10. Maintenance: International Telephone circuits: Maintenance Methods, ITU/CCITT Recommendation M.730, 1993. 11. Knights PF. Downtime priorities, jack-knife diagrams, and the business cycle. Maintenance Journal, Engineering Information Transfer Pty Ltd 2004; 17(2): 15–22. 12. Maintenance Philosophy for Telecommunication Networks, ITU/CCITT Recommendation M.20, 10/1992. 13. Reason J, Hobbs A. Managing Maintenance Error. Ashgate: Aldershat, UK, 2003. 14. O’Hanlon T. CMMS best practices. Maintenance Journal, Engineering Information Transfer Pty Ltd 2004; 16(3): 19–22. 15. Weick KE, Sutcliffe KM. Managing the Unexpected: Assuring High Performance in an Age of Complexity. Jossey-Bass: San Frances, 2001. 16. Udupa DK. TMN: Telecommunications Management Network (1st edn). McGraw-Hill: New York, 1999. 17. Limoncelli TA, Hogan C. The Practice of System and Network Administration. Addison-Wesley: Reading, MA. 2002. 18. Maintenance: Introduction and General Principles of Maintenance and Maintenance Organization. Technical Service, ITU/CCITT Recommendation M.75, 10/1992. 19. Maintenance: International Telephone circuits: Maintenance Methods, ITU/CCITT Recommendation M.733, 1993. 20. Leinwand A, Conroy KF. Network Management: A Practical Perspective (2nd edn). Addison-Wesley: Reading, MA, 1996. 21. Warren G et al. Network simulation enhancing network management in real-time. ACM Transactions on Modeling and Computer Simulation 2004; 14(2): 196–203. 22. Chamberlain J. Fiber optic physical layer preventive maintenance: an answer to increased reliability and plant longevity. In National Fiber Optic Engineers Conference (NFOEC), 22–23 September 1997; 99–109. 23. Maintenance: International Telephone circuits: Periodicity of Maintenance Measurements on Circuits Maintenance. ITU-T/CCITT Recommendation M.610, 1993. 24. Liebowitz J (ed.). Expert System Applications to Telecommunications. Wiley: Chichester, 1988. 25. Ballard D. Artificial Intelligence for Communications Resource Management. White paper, Reticular Systems Inc., 1996; 1–9.

Copyright © 2005 John Wiley & Sons, Ltd.

F. KAMOUN

26. Leckie C. Experience and trends in AI for network monitoring and diagnosis. In Proceedings of the IJCAI95 Workshop on AI in Distributed Information Networks, Montreal, Canada, 1995. 27. Zhang P, Landwehr J, Serban M. Quantifying the value of remote maintenance. CNET IT Papers, March 2004. http://itpapers.news.com/search.aspx? kw=avaya%2520and%2520voip&dtid=1 [8 November 2004]. 28. http://webstore.ansi.org/ansidocstore/dept.asp? dept_id=20 [8 November 2004]. 29. http://telecom-info.telcordia.com/site-cgi/ido/ [8 November 2004]. 30. http://www.itu.int/library/ [8 November 2004]. 31. http://www.ietf.org/rfc.html [8 November 2004]. 32. Howe C. Managing the sync network: a critical resource. In National Fiber Optic Engineers Conference (NFOEC), 14–17 September 1998. 33. Preventing injuries and deaths from fall during construction and maintenance of telecommunication towers. DHHS (National Institute for Occupational Safety and Health) Publication No. 2001-156, July 2001. http://www.cdc.gov/niosh/2001156.html [8 November 2004]. 34. Marsh JA. DWDM system testing: deployment and maintenance issues. In National Fiber Optic Engineers Conference (NFOEC), 26–30 September 1999. 35. Coker D. New business case for integrated network monitoring. In National Fiber Optic Engineers Conference (NFOEC), 29–31 August 2000. 36. Network maintenance alarm and control for network elements: generic requirements. TELCORDIA GR-474-CORE 1997; Issue 1. 37. Petkovic MS et al. The role of GIS in telecommunication network maintenance. In Proceedings of Third Joint European Conference and Exhibition on Geographical Information, JEC-GI’97, Vienna, 1997; 1269–1278. 38. Godin L. GIS in Telecommunications Management. ESRI Press: Redlands, CA, 2001. 39. Kawamura R. Architectures for ATM networks survivability. IEEE Communications Surveys 1998; 1(1). 40. Watts J, Taylor S. A practical approach to dynamic load balancing. IEEE Transactions on Parallel and Distributed Systems 1998; 9(3). 41. Acharya S et al. MobiPack: optimal Hitless SONET defragmentation in near-optimal cost. In IEEE Infocom 2004, Hong Kong. 

If you wish to order reprints for this or any other articles in the International Journal of Network Management, please see the Special Reprint Instructions inside the front cover.

Int. J. Network Mgmt 2005; 15: 321–334