Module 7: Server Cluster Maintenance and Troubleshooting Contents Overview 1 Cluster Maintenance 2 Troubleshooting Cluster Service 11 Lab A: Cluster Maintenance 24 Review 30   Module 7: Server Cluster Maintenance and Troubleshooting iii Instructor Notes Presentation: 45 Minutes Lab: 15 Minutes This module is intended to prepare the students to successfully back up and restore a server cluster. Students need to know how to use the troubleshooting tools available for troubleshooting server cluster problems. The module covers common Cluster service problems and possible resolutions. After completing this module, you will be able to: Perform the steps to successfully back up a server cluster. Perform the steps to successfully restore a server cluster. Evict a node from a server cluster. Identify the tools that are necessary to troubleshoot a cluster failure. Interpret the entries on the cluster log. Identify and troubleshoot common server cluster failures: network communications, small computer system interface (SCSI) configuration problems, group, resource, and quorum failures. Materials and Preparation This section provides the materials and preparation tasks that you need to teach this module. Required Materials To teach this module, you need the Microsoft® PowerPoint® file 2087A_02.ppt Preparation Tasks To prepare for this module, you should: Read the materials for this module and anticipate questions students may ask. Read Q224075, Q257892, Q248998, Q172951, Q266274, Q234767, Q193890, Q245762 and “Interpreting MSCS Cluster Log, on the Student compact disk. Be familiar with the Resource Kit Utilities. Practice the labs. Study the review questions and prepare alternative answers for discussion. iv Module 7: Server Cluster Maintenance and Troubleshooting Module Strategy Use the following strategy to present this module: Because backing up the cluster is a key maintenance task, the first section begins with information on how to backup the cluster configuration files. The following pages cover the complete procedure for restoring an entire cluster in case of catastrophic failure. You can also use each of the topics as a separate procedure for performing a specific task. The troubleshooting section lists the tools that are available for troubleshooting Cluster service and gives common problems and suggested resolutions. Cluster Maintenance Cluster service is self-tuning and requires no maintenance other than daily backups. • Backup: Backing up the system state backs up the cluster configuration files; however, you also need to back up each node’s data and operating system and the cluster disks. • Restoring the First Node: The overall procedure for restoring a cluster is outlined on this page. The first step, restoring the operating system on the first node, is also covered. The remaining steps are covered in detail on the following pages. • Restoring Cluster Disks: Cluster service uses the disk signature file to identify the cluster disk. To replace this disk, you must write the disk signature file of the old disk onto the new disk. • Restoring the Second Node: Restoring the remaining nodes of the cluster is similar to restoring the first node, except that after it is restored, you need to test the failover capabilities of the cluster before putting the cluster back into the production environment. • Evicting a Node: Evicting a node is a manual process through Cluster Administrator. As always, it is important to have a good backup of the server prior to the eviction process. Module 7: Server Cluster Maintenance and Troubleshooting v Troubleshooting Cluster Service The key point of this section is to give the students the tools and techniques that are useful in reducing the time it takes to find a root cause for common Cluster service problems. • Troubleshooting Tools: The tools that are used to help troubleshoot a problem with Cluster service are the same tools that are used to help troubleshoot a server running Microsoft Windows® 2000. • Examining the Cluster Log: Cluster service logs every change configuration and problem to the cluster log. It is important for the students to become familiar with the syntax of the log. • Troubleshooting Network Communications: Students need to know that there are different troubleshooting paths to follow depending on whether the network problem is a node-to-node or a client-to-node problem. • SCSI Configuration Problems: SCSI is less reliable than Fibre. There can be problems with the SCSI controller, SCSI termination, and SCSI cabling. • Group and Resource Failures: Remind students to keep dependency trees vertical so that if a resource fails, it is easier to find a root cause as to which resource is causing the failure of the group. • Quorum Log Corruption: If Cluster service cannot write information to the quorum log, it will not start. You can attempt to reset the quorum log, or you can delete the quorum log and let Cluster service create a new log.
