Applying Maintenance Methodologies
Support and maintenance are two of the core tasks that network engineers perform. The objective of network maintenance is to keep the network available with minimum service disruption and at acceptable performance levels. Network maintenance includes regularly scheduled tasks such as making backups and upgrading devices or software. Structured network maintenance provides a guideline that you can follow to maximize network uptime and minimize unplanned outages, but the exact techniques you should use are governed by your company’s policies and procedures and your experience and preferences. Network support includes following up on interrupt-driven tasks, such as responding to device and link failures and to users who need help. You must evaluate the commonly practiced models and methodologies used for network maintenance and identify the benefits that these models bring to your organization. You must also select generalized maintenance models and planning tools that fit your organization the best.
Maintenance Models and Methodologies
A typical network engineer’s job description usually includes elements such as installing, implementing, maintaining, and supporting network equipment. The exact set of tasks performed by network engineers might differ between organizations. Depending on the size and type of organization, some or all of the following are likely to be included in that set:
-
Tasks related to device installation and maintenance: Includes tasks such as installing devices and software, and creating and backing up configurations and software
-
Tasks related to failure response: Includes tasks such as supporting users that experience network problems, troubleshooting device or link failures, replacing equipment, and restoring backups
-
Tasks related to network performance: Includes tasks such as capacity planning, performance tuning, and usage monitoring
-
Tasks related to business procedures: Includes tasks such as documenting, compliance auditing, and service level agreement (SLA) management
-
Tasks related to security: Includes tasks such as following and implementing security procedures and security auditing
Network engineers must not only understand their own organization’s definition of network maintenance and the tasks it includes, but they must also comprehend the policies and procedures that govern how those tasks are performed. In many smaller networks, the process is largely interrupt driven. For example, when users have problems, you start helping them, or when applications experience performance problems, you upgrade links or equipment. Another example is that a company’s network engineer reviews and improves the security of the network only when security concerns or incidents are reported. Although this is obviously the most basic method of performing network maintenance, it clearly has some disadvantages, including the following:
-
Tasks that are beneficial to the long-term health of the network might be ignored, postponed, or forgotten.
-
Tasks might not be executed in order of priority or urgency, but instead in the order they were requested.
-
The network might experience more downtime than necessary because problems are not prevented.
You cannot avoid interrupt-driven work entirely because failures will happen and you cannot plan them. However, you can reduce the amount of incident-driven (interrupt-driven) work by proactively monitoring and managing systems.
The alternative to the interrupt-driven model of maintenance is structured network maintenance. Structured network maintenance predefines and plans much of the processes and procedures. This proactive approach not only reduces the frequency and quantity of user, application, and business problems, it also renders the responses to incidents more efficiently. The structured approach to network maintenance has some clear benefits over the interrupt-driven approach, including the following:
-
Reduced network downtime: By discovering and preventing problems before they happen, you can prevent or at least minimize network downtime. You should strive to maximize mean time between failures (MTBF). Even if you cannot prevent problems, you can reduce the amount of time it takes to fix them by following proper procedures and using adequate tools. You should strive to minimize mean time to repair (MTTR). Maximizing MTBF and minimizing MTTR translates to lower financial damage and higher user satisfaction.
-
More cost-effectiveness: Performance monitoring and capacity planning allows you to make adequate budgeting decisions for current and future networking needs. Choosing proper equipment and using it to capacity means better price/performance ratio over the lifetime of your equipment. Lower maintenance costs and network downtime also help to reduce the price/performance ratio.
-
Better alignment with business objectives: Within the structured network maintenance framework, instead of prioritizing tasks and assigning budgets based on incidents, time and resources are allocated to processes based on their importance to the business. For example, upgrades and major maintenance jobs are not scheduled during critical business hours.
-
Higher network security: Attention to network security is part of structured network maintenance. If prevention techniques do not stop a breach or attack, detection mechanisms will contain them, and support staff will be notified through logs and alarms. Monitoring allows you to observe network vulnerabilities and needs and to justify plans for strengthening network security.
Several well-known network maintenance methodologies have been defined by a variety of organizations, including the International Organization for Standardization (ISO), International Telecommunication Union Telecommunication Standardization sector (ITU-T), and Cisco Systems. Network support engineers must study these and incorporate the elements of these models as per their environment needs. Four examples of well-known network maintenance and methodologies are as follows:
-
IT Infrastructure Library (ITIL): This framework for IT service management describes best practices that help in providing high-quality IT services that are aligned with business needs and processes.
-
FCAPS: This model, defined by ISO, divides network management tasks into five different categories:
Note that the term FCAPS is driven from the first letter of each management category.
-
Telecommunications Management Network (TMN): The ITU-T integrated and refined the FCAPS model to define a conceptual framework for the management of telecommunications networks and describes establishing a management network that interfaces with a telecommunications network at several different points for the purpose of manual and automated maintenance tasks.
-
Cisco Lifecycle Services: This approach is a model that helps businesses to successfully deploy, operate, and optimize Cisco technologies in their network. This model is sometimes also referenced to as the Prepare, Plan, Design, Implement, Operate, and Optimize (PPDIOO) model, based on the names of the six phases of the network lifecycle. Network maintenance tasks are usually considered part of the operate and optimize phases of the cycle.
Determining Procedures and Tools to Support Maintenance Models
Those who decide to perform structured network maintenance either select one of the recommended network maintenance models or build a custom model that meets their particular needs by taking elements from different models. For example, if a company chooses the FCAPS model, they will have five main management tasks on their hands:
-
Fault management: Fault management is the domain where network problems are discovered and corrected. Although some of this effort is inevitably event driven, the focus here is on preventive maintenance. Proper steps are taken to prevent breakdowns and past incidents from recurring; hence, network downtime is minimized.
-
Configuration management: Configuration management is concerned with tasks such as installation, identification, and configuration of hardware (including components such a line cards, modules, memory, and power supplies) and services. Configuration management also includes software and firmware management, change control, inventory management, plus monitoring and managing the deployment status of devices.
-
Accounting management: Accounting management focuses on how to optimally distribute resources among enterprise subscribers. This helps to minimize the cost of operations by making the most effective use of the systems available. Cost distribution and billing of departments/users are also accounting management tasks.
-
Performance management: Performance management is about managing the overall performance of the enterprise network. The focus here is on maximizing throughput, identifying bottlenecks, and forming plans to enhance performance.
-
Security management: Security management is responsible for ensuring confidentiality, integrity, and availability (CIA). The network must be protected against unauthorized access and physical and electronic sabotage. Security management ensures CIA through authentication, authorization, and accounting (AAA), plus other techniques such as encryption, network perimeter protection, intrusion detection/prevention, and security monitoring and reporting.
Upon selection of a network maintenance model, you must translate the theoretical model to practical procedures that structure the network maintenance processes for your network. Figure 1-1 shows an example where four procedures are defined for the configuration management element of the FCAPS model.
After you have defined your processes and procedures, it becomes much easier to see what functionalities and tools you need to have in your network management toolkit to support these processes. As a result, you can select an efficient and cost-effective network management and support toolkit that offers those tools and hopefully meets your budgetary constraints. Figure 1-1 shows a network management toolkit (in the rightmost column) that offers four tools, one for each of the defined procedures (in the middle column). It is noteworthy that an interrupt-driven network maintenance approach usually leads to a fragmented network management toolkit. The reason is that tools are acquired on an on-demand basis to deal with a particular need, instead of considering the toolkit as a whole and building it to support all the network maintenance processes.
Maintenance Processes and Procedures
Network maintenance involves many tasks. Some of these tasks are nearly universal, whereas others might be deployed by only some organizations or performed in unique ways. Processes such as maintenance planning, change control, documentation, disaster recovery, and network monitoring are common elements of all network maintenance plans. To establish procedures that fit an organization’s needs best, network engineers need to do the following:
-
Identify essential network maintenance tasks.
-
Recognize and describe the advantages of scheduled maintenance.
-
Evaluate the key decision factors that affect change control procedures to create procedures that fit organization’s needs.
-
Describe the essential elements of network documentation and its function.
-
Plan for efficient disaster recovery.
-
Describe the importance of network monitoring and performance measurement as an integral element of a proactive network maintenance strategy.
Network Maintenance Task Identification
Regardless of the network maintenance model and methodology you choose or the size of your network, certain tasks must be included in your network maintenance plan. The amount of resources, time, and money you spend on these tasks will vary, however, depending on the size and type of your organization. All network maintenance plans need to include procedures to perform the following tasks:
-
Accommodating adds, moves, and changes: Networks are always undergoing changes. As people move and offices are changed and restructured, network devices such as computers, printers, and servers might need to be moved, and configuration and cabling changes might be necessary. These adds, moves, and changes are a normal part of network maintenance.
-
Installation and configuration of new devices: This task includes adding ports, link capacity, network devices, and so on. Implementation of new technologies or installation and configuration of new devices is either handled by a different group within your organization, by an external party, or handled by internal staff.
-
Replacement of failed devices: Whether replacement of failed devices is done through service contracts or done in house by support engineers, it is an important network maintenance task.
-
Backup of device configurations and software: This task is linked to the task of replacing failed devices. Without good backups of both software and configurations, the time to replace failed equipment or recover from severe device failures will not be trouble free and might take a long time.
-
Troubleshooting link and device failures: Failures are inevitable; diagnosing and resolving failures related to network components, links, or service provider connections are essential tasks within a network engineer’s job.
-
Software upgrading or patching: Network maintenance requires that you stay informed of available software upgrades or patches and use them if necessary. Critical performance or security vulnerabilities are often addressed by the software upgrades or patches.
-
Network monitoring: Monitoring operation of the devices and user activity on the network is also part of a network maintenance plan. Network monitoring can be performed using simple mechanisms such as collection of router and firewall logs or by using sophisticated network monitoring applications.
-
Performance measurement and capacity planning: Because the demand for bandwidth is continually increasing, another network maintenance task is to perform at least some basic measurements to decide when it is time to upgrade links or equipment and to justify the cost of the corresponding investments. This proactive approach allows one to plan for upgrades (capacity planning) before bottlenecks are formed, congestions are experienced, or failures occur.
-
Writing and updating documentation: Preparing proper network documentation that describes the current state of the network for reference during implementation, administration, and troubleshooting is a mandatory network maintenance task within most organizations. Network documentation must be kept current.
Network Maintenance Planning
You must build processes and procedures for performing your network maintenance tasks; this is called network maintenance planning. Network maintenance planning includes the following:
Scheduling Maintenance
After you have determined the tasks and processes that are part of network maintenance, you can assign priorities to them. You can also determine which of these tasks will be interrupt driven by nature (hardware failures, outages, and so on) and which tasks are parts of a long-term maintenance cycle (software patching, backups, and so on). For the long-term tasks, you will have to work out a schedule that guarantees that these tasks will be done regularly and will not get lost in the busy day-to-day work schedule. For some tasks such as moves and changes, you can adopt a procedure that is partly interrupt driven (incoming change requests) and partly scheduled—Change requests need not be handled immediately, but during the next scheduled timeframe. This allows you to properly prioritize tasks but still have a predictable lead time that the requesting party knows they can count on for a change to be executed. With scheduled maintenance, tasks that are disruptive to the network are scheduled during off-hours. You can select maintenance windows during evenings or weekends where outages will be acceptable, thereby reducing unnecessary outages during office hours. The uptime of the network will increase as both the number of unplanned outages and their duration will be reduced. To summarize, the benefits of scheduled maintenance include the following:
-
Network downtime is reduced.
-
Long-term maintenance tasks will not be neglected or forgotten.
-
You have predictable lead times for change requests.
-
Disruptive maintenance tasks can be scheduled during assigned maintenance windows, reducing downtime during production hours.
Formalizing Change-Control Procedures
Sometimes it is necessary to make changes to configuration, software, or hardware. Any change that you make has an associated risk due to possible mistakes, conflicts, or bugs. Before making any change, you must first determine the impact of the change on the network and balance this against the urgency of the change. If the anticipated impact is high, you might need to justify the need for the change and obtain authorization to proceed. High-impact changes are usually made during maintenance windows specifically scheduled for this purpose. On the other hand, there will also have to be a process for emergency changes. For example, if a broadcast storm occurs in your network and a link needs to be disconnected to break the loop and allow the network to stabilize, you might not be able to wait for authorization and the next maintenance window. In many companies, change control is formalized and answers the following types of questions:
-
Which types of change require authorization and who is responsible for authorizing them?
-
Which changes have to be done during a maintenance window and which changes can be done immediately?
-
What kind of preparation needs to be done before executing a change?
-
What kind of verification needs to be done to confirm that the change was effective?
-
What other actions (such as updating documentation) need to be taken after a successful change?
-
What actions should be taken when a change has unexpected results or causes problems?
-
What conditions allow skipping some of the normal change procedures and which elements of the procedures should still be followed?
Establishing Network Documentation Procedures
An essential part of any network maintenance is building and keeping up-to-date network documentation. Without up-to-date network documentation, it is difficult to correctly plan and implement changes, and troubleshooting is tedious and time-consuming. Usually, documentation is created as part of network design and implementation, but keeping it up-to-date is part of network maintenance. Therefore, any good change-control procedure will include updating the relevant documentation after the change is made. Documentation can be as simple as a few network drawings, equipment and software lists, and the current configurations of all devices. On the other hand, documentation can be extensive, describing all implemented features, design choices that were made, service contract numbers, change procedures, and so on. Typical elements of network documentation include the following:
-
Network drawings: Diagrams of the physical and logical structure of the network
-
Connection documentation: Lists of all relevant physical connections, such as patches, connections to service providers, and power circuits
-
Equipment lists: Lists of all devices, part numbers, serial numbers, installed software versions, (if applicable) software licenses, warranty/service information
-
IP address administration: Lists of the IP subnets scheme and all IP addresses in use
-
Configurations: A set of all current device configurations or even an archive that contains all previous configurations
-
Design documentation: This is a document describing the motivation behind certain implementation choices.
Establishing Effective Communication
Network maintenance is usually performed by a team of people and cannot easily be divided into exclusive sets of tasks that do not affect each other. Even if you have specialists who are responsible for particular technologies or set of devices, they will always have to communicate with team members who are responsible for different technologies or other devices. The best means of communication depends on the situation and organization, but a major consideration for choosing a communication method is how easily it is logged and shared with the network maintenance team.
Communication is vital both during troubleshooting and technical support and afterward. During troubleshooting, certain questions must be answered, such as the following:
-
Who is making changes and when?
-
How does the change affect others?
-
What are the results of tests that were done, and what conclusions can be drawn?
If actions, test results, and conclusions are not communicated between team members, the process in the hands of one team member can be disruptive to the process handled by another team member. You don’t want to create new problems while solving others.
In many cases, diagnosis and resolution must be done by several persons or during multiple sessions. In those cases, it is important to have a log of actions, tests, communication, and conclusions. These must be distributed among all those involved. With proper communication, one team member should comfortably take over where another team member has left off. Communication is also required after completion of troubleshooting or making changes.
Defining Templates/Procedures/Conventions (Standardization)
When a team of people execute the same or related tasks, it is important that those tasks be performed consistently. Because people might inherently have different working methods, styles, and backgrounds, standardization makes sure work performed by different people remain consistent. Even if two different approaches to the same task are both valid, they might yield inconsistent results. One of the ways to streamline processes and make sure that tasks are executed in a consistent manner is to define and document procedures; this is called standardization. Defining and using templates is an effective method of network documentation, and it helps in creating a consistent network maintenance process. The following are some of the types of questions answered by network conventions, templates, and best practices (standardization) documentation:
-
Are logging and debug time stamps set to local time or coordinated universal time (UTC)?
-
Should access lists end with an explicit “deny any”?
-
In an IP subnet, is the first or the last valid IP address allocated to the local gateway?
In many cases, you can configure a device in several different ways to achieve the same results. However, using different methods of achieving the same results in the same network can easily lead to confusion, especially during troubleshooting. Under pressure, valuable time can be wasted in verifying configurations that are assumed incorrect simply because they are configured differently.
Planning for Disaster Recovery
Although the modern MTBF for certain network devices is claimed to be 5, 7, or 10 years or more, you must always consider the possibility of device failure. By having a plan for such occasions and knowing what to do, you can significantly reduce the amount of downtime. One way to reduce the impact of failure is to build redundancy into the network at critical points and eliminate single points of failure. A single point of failure means that a single device or link does not have a backup and its failure can cause major damage to your network operation. However, mainly because of budgetary limitations, it is not always possible to make every single link, component, and device redundant. Disasters, natural and otherwise, must also be taken into account. For example, you could be struck by a disaster such as a flood or fire in the server room. The quicker you can replace failed devices and restore functionality, the quicker your network will be running again. To replace a failed device, you need the following items:
-
Replacement hardware
-
The current software version for the device
-
The current configuration for the device
-
The tools to transfer the software and configuration to the device
-
Licenses (if applicable)
-
Knowledge of the procedures to install software, configurations, and licenses
Missing any of the listed items severely affects the time it takes to replace the device. To make sure that you have these items available when you need them, follow the following guidelines:
-
Replacement hardware: You either need to have spare devices or a service contract with a distributor or vendor that will replace the failed hardware. Typically, this means that you need documentation of the exact hardware part numbers, serial numbers, and service contract numbers for the devices.
-
Current software: Usually devices are delivered with a particular version of software, which is not necessarily the same as the version that you were running on the device. Therefore, you should have a repository where you store all current software versions in use on your network.
-
Current configuration: In addition to creating backups of your configurations any time you make a change, you need to have a clear versioning system so that you know which configuration is the most recent.
-
Tools: You need to have the appropriate tools to transfer software and configurations to the new device, which you should be able to do even if the network is unavailable.
-
Licenses: If your software requires a license, you need to have that license or know the procedure to obtain a new license.
-
Knowledge: Because these procedures are used infrequently, you might not have them committed to memory. Having all necessary documentation ready, however, will save time in executing the necessary procedures and will also decrease the risk of making mistakes.
In short, the key factors to a successful disaster recovery are defining and documenting recovery procedures and making sure you always have the necessary elements available in case a disaster strikes.
Network Monitoring and Performance Measurement
Another process that helps you transform your network maintenance process to a less interrupt-driven, more methodical approach is the implementation of network and performance monitoring. Ideally, you want to be able to spot potential issues before they develop into problems, and to be able to isolate problems faster when they occur. Gathering performance data enables you to upgrade before a lack of resource situation develops into a performance problem. Gathering performance data also helps in building a business case for investing in network upgrades. When you are committed to meeting the SLAs for the performance of your network, or if your service provider is guaranteeing you a certain level of service, monitoring network performance can assist you in determining whether those SLAs are met.
One essential step in network performance measurement and monitoring is choosing the variables to be monitored and measured, including interface status, interface load, CPU load, and memory usage of your devices. Also, the more sophisticated metrics such as measurements of network delay, jitter, or packet loss can be included in a network monitoring and performance measurement policy. The network performance measurement and monitoring policy and the corresponding choices of metrics will differ for each organization and need to be aligned to the business requirements.
No comments:
Post a Comment