Thursday, May 26, 2011

Chapter 01: Planning Maintenance for Complex Networks (Part01)

Many modern business processes and transactions depend on high availability and reliability of an organization’s computer network and computing resources. Downtime can cause significant loss of reputation/revenue. Planning the network maintenance processes and procedures facilitates high availability and cost control. This chapter presents and evaluates commonly practiced models and methodologies for network maintenance, introduces the processes and procedures that are fundamental parts of any network maintenance methodology, and identifies and evaluates tools, applications, and resources that support network maintenance processes.

Add a note here Applying Maintenance Methodologies

Add a note hereSupport and maintenance are two of the core tasks that network engineers perform. The objective of network maintenance is to keep the network available with minimum service disruption and at acceptable performance levels. Network maintenance includes regularly scheduled tasks such as making backups and upgrading devices or software. Structured network maintenance provides a guideline that you can follow to maximize network uptime and minimize unplanned outages, but the exact techniques you should use are governed by your company’s policies and procedures and your experience and preferences. Network support includes following up on interrupt-driven tasks, such as responding to device and link failures and to users who need help. You must evaluate the commonly practiced models and methodologies used for network maintenance and identify the benefits that these models bring to your organization. You must also select generalized maintenance models and planning tools that fit your organization the best.

Add a note here Maintenance Models and Methodologies

Add a note here A typical network engineer’s job description usually includes elements such as installing, implementing, maintaining, and supporting network equipment. The exact set of tasks performed by network engineers might differ between organizations. Depending on the size and type of organization, some or all of the following are likely to be included in that set:

  • Add a note here Tasks related to device installation and maintenance: Includes tasks such as installing devices and software, and creating and backing up configurations and software

  • Add a note here Tasks related to failure response: Includes tasks such as supporting users that experience network problems, troubleshooting device or link failures, replacing equipment, and restoring backups

  • Add a note here Tasks related to network performance: Includes tasks such as capacity planning, performance tuning, and usage monitoring

  • Add a note here Tasks related to business procedures: Includes tasks such as documenting, compliance auditing, and service level agreement (SLA) management

  • Add a note here Tasks related to security: Includes tasks such as following and implementing security procedures and security auditing

Add a note hereNetwork engineers must not only understand their own organization’s definition of network maintenance and the tasks it includes, but they must also comprehend the policies and procedures that govern how those tasks are performed. In many smaller networks, the process is largely interrupt driven. For example, when users have problems, you start helping them, or when applications experience performance problems, you upgrade links or equipment. Another example is that a company’s network engineer reviews and improves the security of the network only when security concerns or incidents are reported. Although this is obviously the most basic method of performing network maintenance, it clearly has some disadvantages, including the following:

  • Add a note hereTasks that are beneficial to the long-term health of the network might be ignored, postponed, or forgotten.

  • Add a note hereTasks might not be executed in order of priority or urgency, but instead in the order they were requested.

  • Add a note hereThe network might experience more downtime than necessary because problems are not prevented.

Add a note hereYou cannot avoid interrupt-driven work entirely because failures will happen and you cannot plan them. However, you can reduce the amount of incident-driven (interrupt-driven) work by proactively monitoring and managing systems.

Add a note hereThe alternative to the interrupt-driven model of maintenance is structured network maintenance. Structured network maintenance predefines and plans much of the processes and procedures. This proactive approach not only reduces the frequency and quantity of user, application, and business problems, it also renders the responses to incidents more efficiently. The structured approach to network maintenance has some clear benefits over the interrupt-driven approach, including the following:

  • Add a note here Reduced network downtime: By discovering and preventing problems before they happen, you can prevent or at least minimize network downtime. You should strive to maximize mean time between failures (MTBF). Even if you cannot prevent problems, you can reduce the amount of time it takes to fix them by following proper procedures and using adequate tools. You should strive to minimize mean time to repair (MTTR). Maximizing MTBF and minimizing MTTR translates to lower financial damage and higher user satisfaction.

  • Add a note here More cost-effectiveness: Performance monitoring and capacity planning allows you to make adequate budgeting decisions for current and future networking needs. Choosing proper equipment and using it to capacity means better price/performance ratio over the lifetime of your equipment. Lower maintenance costs and network downtime also help to reduce the price/performance ratio.

  • Add a note here Better alignment with business objectives: Within the structured network maintenance framework, instead of prioritizing tasks and assigning budgets based on incidents, time and resources are allocated to processes based on their importance to the business. For example, upgrades and major maintenance jobs are not scheduled during critical business hours.

  • Add a note here Higher network security: Attention to network security is part of structured network maintenance. If prevention techniques do not stop a breach or attack, detection mechanisms will contain them, and support staff will be notified through logs and alarms. Monitoring allows you to observe network vulnerabilities and needs and to justify plans for strengthening network security.

Add a note hereSeveral well-known network maintenance methodologies have been defined by a variety of organizations, including the International Organization for Standardization (ISO), International Telecommunication Union Telecommunication Standardization sector (ITU-T), and Cisco Systems. Network support engineers must study these and incorporate the elements of these models as per their environment needs. Four examples of well-known network maintenance and methodologies are as follows:

  • Add a note here IT Infrastructure Library (ITIL): This framework for IT service management describes best practices that help in providing high-quality IT services that are aligned with business needs and processes.

  • Add a note here FCAPS: This model, defined by ISO, divides network management tasks into five different categories:

    • Add a note hereFault management

    • Add a note hereConfiguration management

    • Add a note hereAccounting management

    • Add a note here Performance management

    • Add a note hereSecurity management.

Add a note hereNote that the term FCAPS is driven from the first letter of each management category.

  • Add a note here Telecommunications Management Network (TMN): The ITU-T integrated and refined the FCAPS model to define a conceptual framework for the management of telecommunications networks and describes establishing a management network that interfaces with a telecommunications network at several different points for the purpose of manual and automated maintenance tasks.

  • Add a note here Cisco Lifecycle Services: This approach is a model that helps businesses to successfully deploy, operate, and optimize Cisco technologies in their network. This model is sometimes also referenced to as the Prepare, Plan, Design, Implement, Operate, and Optimize (PPDIOO) model, based on the names of the six phases of the network lifecycle. Network maintenance tasks are usually considered part of the operate and optimize phases of the cycle.

Add a note here Determining Procedures and Tools to Support Maintenance Models

Add a note hereThose who decide to perform structured network maintenance either select one of the recommended network maintenance models or build a custom model that meets their particular needs by taking elements from different models. For example, if a company chooses the FCAPS model, they will have five main management tasks on their hands:

  • Add a note here Fault management: Fault management is the domain where network problems are discovered and corrected. Although some of this effort is inevitably event driven, the focus here is on preventive maintenance. Proper steps are taken to prevent breakdowns and past incidents from recurring; hence, network downtime is minimized.

  • Add a note here Configuration management: Configuration management is concerned with tasks such as installation, identification, and configuration of hardware (including components such a line cards, modules, memory, and power supplies) and services. Configuration management also includes software and firmware management, change control, inventory management, plus monitoring and managing the deployment status of devices.

  • Add a note here Accounting management: Accounting management focuses on how to optimally distribute resources among enterprise subscribers. This helps to minimize the cost of operations by making the most effective use of the systems available. Cost distribution and billing of departments/users are also accounting management tasks.

  • Add a note here Performance management: Performance management is about managing the overall performance of the enterprise network. The focus here is on maximizing throughput, identifying bottlenecks, and forming plans to enhance performance.

  • Add a note here Security management: Security management is responsible for ensuring confidentiality, integrity, and availability (CIA). The network must be protected against unauthorized access and physical and electronic sabotage. Security management ensures CIA through authentication, authorization, and accounting (AAA), plus other techniques such as encryption, network perimeter protection, intrusion detection/prevention, and security monitoring and reporting.

Add a note hereUpon selection of a network maintenance model, you must translate the theoretical model to practical procedures that structure the network maintenance processes for your network. Figure 1-1 shows an example where four procedures are defined for the configuration management element of the FCAPS model.

Click to collapse
Add a note hereFigure 1-1: Models, Procedures, and Tools

Add a note hereAfter you have defined your processes and procedures, it becomes much easier to see what functionalities and tools you need to have in your network management toolkit to support these processes. As a result, you can select an efficient and cost-effective network management and support toolkit that offers those tools and hopefully meets your budgetary constraints. Figure 1-1 shows a network management toolkit (in the rightmost column) that offers four tools, one for each of the defined procedures (in the middle column). It is noteworthy that an interrupt-driven network maintenance approach usually leads to a fragmented network management toolkit. The reason is that tools are acquired on an on-demand basis to deal with a particular need, instead of considering the toolkit as a whole and building it to support all the network maintenance processes.


Maintenance Processes and Procedures

Add a note hereNetwork maintenance involves many tasks. Some of these tasks are nearly universal, whereas others might be deployed by only some organizations or performed in unique ways. Processes such as maintenance planning, change control, documentation, disaster recovery, and network monitoring are common elements of all network maintenance plans. To establish procedures that fit an organization’s needs best, network engineers need to do the following:

  • Add a note hereIdentify essential network maintenance tasks.

  • Add a note here Recognize and describe the advantages of scheduled maintenance.

  • Add a note hereEvaluate the key decision factors that affect change control procedures to create procedures that fit organization’s needs.

  • Add a note hereDescribe the essential elements of network documentation and its function.

  • Add a note herePlan for efficient disaster recovery.

  • Add a note hereDescribe the importance of network monitoring and performance measurement as an integral element of a proactive network maintenance strategy.

Add a note here Network Maintenance Task Identification

Add a note hereRegardless of the network maintenance model and methodology you choose or the size of your network, certain tasks must be included in your network maintenance plan. The amount of resources, time, and money you spend on these tasks will vary, however, depending on the size and type of your organization. All network maintenance plans need to include procedures to perform the following tasks:

  • Add a note here Accommodating adds, moves, and changes: Networks are always undergoing changes. As people move and offices are changed and restructured, network devices such as computers, printers, and servers might need to be moved, and configuration and cabling changes might be necessary. These adds, moves, and changes are a normal part of network maintenance.

  • Add a note here Installation and configuration of new devices: This task includes adding ports, link capacity, network devices, and so on. Implementation of new technologies or installation and configuration of new devices is either handled by a different group within your organization, by an external party, or handled by internal staff.

  • Add a note here Replacement of failed devices: Whether replacement of failed devices is done through service contracts or done in house by support engineers, it is an important network maintenance task.

  • Add a note here Backup of device configurations and software: This task is linked to the task of replacing failed devices. Without good backups of both software and configurations, the time to replace failed equipment or recover from severe device failures will not be trouble free and might take a long time.

  • Add a note here Troubleshooting link and device failures: Failures are inevitable; diagnosing and resolving failures related to network components, links, or service provider connections are essential tasks within a network engineer’s job.

  • Add a note here Software upgrading or patching: Network maintenance requires that you stay informed of available software upgrades or patches and use them if necessary. Critical performance or security vulnerabilities are often addressed by the software upgrades or patches.

  • Add a note here Network monitoring: Monitoring operation of the devices and user activity on the network is also part of a network maintenance plan. Network monitoring can be performed using simple mechanisms such as collection of router and firewall logs or by using sophisticated network monitoring applications.

  • Add a note here Performance measurement and capacity planning: Because the demand for bandwidth is continually increasing, another network maintenance task is to perform at least some basic measurements to decide when it is time to upgrade links or equipment and to justify the cost of the corresponding investments. This proactive approach allows one to plan for upgrades (capacity planning) before bottlenecks are formed, congestions are experienced, or failures occur.

  • Add a note here Writing and updating documentation: Preparing proper network documentation that describes the current state of the network for reference during implementation, administration, and troubleshooting is a mandatory network maintenance task within most organizations. Network documentation must be kept current.

Add a note here Network Maintenance Planning

Add a note hereYou must build processes and procedures for performing your network maintenance tasks; this is called network maintenance planning. Network maintenance planning includes the following:

Scheduling Maintenance

Add a note hereAfter you have determined the tasks and processes that are part of network maintenance, you can assign priorities to them. You can also determine which of these tasks will be interrupt driven by nature (hardware failures, outages, and so on) and which tasks are parts of a long-term maintenance cycle (software patching, backups, and so on). For the long-term tasks, you will have to work out a schedule that guarantees that these tasks will be done regularly and will not get lost in the busy day-to-day work schedule. For some tasks such as moves and changes, you can adopt a procedure that is partly interrupt driven (incoming change requests) and partly scheduled—Change requests need not be handled immediately, but during the next scheduled timeframe. This allows you to properly prioritize tasks but still have a predictable lead time that the requesting party knows they can count on for a change to be executed. With scheduled maintenance, tasks that are disruptive to the network are scheduled during off-hours. You can select maintenance windows during evenings or weekends where outages will be acceptable, thereby reducing unnecessary outages during office hours. The uptime of the network will increase as both the number of unplanned outages and their duration will be reduced. To summarize, the benefits of scheduled maintenance include the following:

  • Add a note hereNetwork downtime is reduced.

  • Add a note hereLong-term maintenance tasks will not be neglected or forgotten.

  • Add a note hereYou have predictable lead times for change requests.

  • Add a note hereDisruptive maintenance tasks can be scheduled during assigned maintenance windows, reducing downtime during production hours.

Formalizing Change-Control Procedures

Add a note hereSometimes it is necessary to make changes to configuration, software, or hardware. Any change that you make has an associated risk due to possible mistakes, conflicts, or bugs. Before making any change, you must first determine the impact of the change on the network and balance this against the urgency of the change. If the anticipated impact is high, you might need to justify the need for the change and obtain authorization to proceed. High-impact changes are usually made during maintenance windows specifically scheduled for this purpose. On the other hand, there will also have to be a process for emergency changes. For example, if a broadcast storm occurs in your network and a link needs to be disconnected to break the loop and allow the network to stabilize, you might not be able to wait for authorization and the next maintenance window. In many companies, change control is formalized and answers the following types of questions:

  • Add a note hereWhich types of change require authorization and who is responsible for authorizing them?

  • Add a note hereWhich changes have to be done during a maintenance window and which changes can be done immediately?

  • Add a note hereWhat kind of preparation needs to be done before executing a change?

  • Add a note hereWhat kind of verification needs to be done to confirm that the change was effective?

  • Add a note hereWhat other actions (such as updating documentation) need to be taken after a successful change?

  • Add a note hereWhat actions should be taken when a change has unexpected results or causes problems?

  • Add a note hereWhat conditions allow skipping some of the normal change procedures and which elements of the procedures should still be followed?

Establishing Network Documentation Procedures

Add a note hereAn essential part of any network maintenance is building and keeping up-to-date network documentation. Without up-to-date network documentation, it is difficult to correctly plan and implement changes, and troubleshooting is tedious and time-consuming. Usually, documentation is created as part of network design and implementation, but keeping it up-to-date is part of network maintenance. Therefore, any good change-control procedure will include updating the relevant documentation after the change is made. Documentation can be as simple as a few network drawings, equipment and software lists, and the current configurations of all devices. On the other hand, documentation can be extensive, describing all implemented features, design choices that were made, service contract numbers, change procedures, and so on. Typical elements of network documentation include the following:

  • Add a note here Network drawings: Diagrams of the physical and logical structure of the network

  • Add a note here Connection documentation: Lists of all relevant physical connections, such as patches, connections to service providers, and power circuits

  • Add a note here Equipment lists: Lists of all devices, part numbers, serial numbers, installed software versions, (if applicable) software licenses, warranty/service information

  • Add a note here IP address administration: Lists of the IP subnets scheme and all IP addresses in use

  • Add a note here Configurations: A set of all current device configurations or even an archive that contains all previous configurations

  • Add a note here Design documentation: This is a document describing the motivation behind certain implementation choices.

Establishing Effective Communication

Add a note hereNetwork maintenance is usually performed by a team of people and cannot easily be divided into exclusive sets of tasks that do not affect each other. Even if you have specialists who are responsible for particular technologies or set of devices, they will always have to communicate with team members who are responsible for different technologies or other devices. The best means of communication depends on the situation and organization, but a major consideration for choosing a communication method is how easily it is logged and shared with the network maintenance team.

Add a note hereCommunication is vital both during troubleshooting and technical support and afterward. During troubleshooting, certain questions must be answered, such as the following:

  • Add a note hereWho is making changes and when?

  • Add a note hereHow does the change affect others?

  • Add a note hereWhat are the results of tests that were done, and what conclusions can be drawn?

Add a note hereIf actions, test results, and conclusions are not communicated between team members, the process in the hands of one team member can be disruptive to the process handled by another team member. You don’t want to create new problems while solving others.

Add a note hereIn many cases, diagnosis and resolution must be done by several persons or during multiple sessions. In those cases, it is important to have a log of actions, tests, communication, and conclusions. These must be distributed among all those involved. With proper communication, one team member should comfortably take over where another team member has left off. Communication is also required after completion of troubleshooting or making changes.

Defining Templates/Procedures/Conventions (Standardization)

Add a note hereWhen a team of people execute the same or related tasks, it is important that those tasks be performed consistently. Because people might inherently have different working methods, styles, and backgrounds, standardization makes sure work performed by different people remain consistent. Even if two different approaches to the same task are both valid, they might yield inconsistent results. One of the ways to streamline processes and make sure that tasks are executed in a consistent manner is to define and document procedures; this is called standardization. Defining and using templates is an effective method of network documentation, and it helps in creating a consistent network maintenance process. The following are some of the types of questions answered by network conventions, templates, and best practices (standardization) documentation:

  • Add a note hereAre logging and debug time stamps set to local time or coordinated universal time (UTC)?

  • Add a note hereShould access lists end with an explicit “deny any”?

  • Add a note hereIn an IP subnet, is the first or the last valid IP address allocated to the local gateway?

Add a note hereIn many cases, you can configure a device in several different ways to achieve the same results. However, using different methods of achieving the same results in the same network can easily lead to confusion, especially during troubleshooting. Under pressure, valuable time can be wasted in verifying configurations that are assumed incorrect simply because they are configured differently.

Planning for Disaster Recovery

Add a note hereAlthough the modern MTBF for certain network devices is claimed to be 5, 7, or 10 years or more, you must always consider the possibility of device failure. By having a plan for such occasions and knowing what to do, you can significantly reduce the amount of downtime. One way to reduce the impact of failure is to build redundancy into the network at critical points and eliminate single points of failure. A single point of failure means that a single device or link does not have a backup and its failure can cause major damage to your network operation. However, mainly because of budgetary limitations, it is not always possible to make every single link, component, and device redundant. Disasters, natural and otherwise, must also be taken into account. For example, you could be struck by a disaster such as a flood or fire in the server room. The quicker you can replace failed devices and restore functionality, the quicker your network will be running again. To replace a failed device, you need the following items:

  • Add a note hereReplacement hardware

  • Add a note hereThe current software version for the device

  • Add a note hereThe current configuration for the device

  • Add a note here The tools to transfer the software and configuration to the device

  • Add a note hereLicenses (if applicable)

  • Add a note hereKnowledge of the procedures to install software, configurations, and licenses

Add a note hereMissing any of the listed items severely affects the time it takes to replace the device. To make sure that you have these items available when you need them, follow the following guidelines:

  • Add a note here Replacement hardware: You either need to have spare devices or a service contract with a distributor or vendor that will replace the failed hardware. Typically, this means that you need documentation of the exact hardware part numbers, serial numbers, and service contract numbers for the devices.

  • Add a note here Current software: Usually devices are delivered with a particular version of software, which is not necessarily the same as the version that you were running on the device. Therefore, you should have a repository where you store all current software versions in use on your network.

  • Add a note here Current configuration: In addition to creating backups of your configurations any time you make a change, you need to have a clear versioning system so that you know which configuration is the most recent.

  • Add a note here Tools: You need to have the appropriate tools to transfer software and configurations to the new device, which you should be able to do even if the network is unavailable.

  • Add a note here Licenses: If your software requires a license, you need to have that license or know the procedure to obtain a new license.

  • Add a note here Knowledge: Because these procedures are used infrequently, you might not have them committed to memory. Having all necessary documentation ready, however, will save time in executing the necessary procedures and will also decrease the risk of making mistakes.

Add a note hereIn short, the key factors to a successful disaster recovery are defining and documenting recovery procedures and making sure you always have the necessary elements available in case a disaster strikes.

Add a note here Network Monitoring and Performance Measurement

Add a note hereAnother process that helps you transform your network maintenance process to a less interrupt-driven, more methodical approach is the implementation of network and performance monitoring. Ideally, you want to be able to spot potential issues before they develop into problems, and to be able to isolate problems faster when they occur. Gathering performance data enables you to upgrade before a lack of resource situation develops into a performance problem. Gathering performance data also helps in building a business case for investing in network upgrades. When you are committed to meeting the SLAs for the performance of your network, or if your service provider is guaranteeing you a certain level of service, monitoring network performance can assist you in determining whether those SLAs are met.

Add a note hereOne essential step in network performance measurement and monitoring is choosing the variables to be monitored and measured, including interface status, interface load, CPU load, and memory usage of your devices. Also, the more sophisticated metrics such as measurements of network delay, jitter, or packet loss can be included in a network monitoring and performance measurement policy. The network performance measurement and monitoring policy and the corresponding choices of metrics will differ for each organization and need to be aligned to the business requirements.


No comments:

Post a Comment