Sunday, 1 July 2012

SSA Drive Replacement



Overview

This document outlines the generic procedure for finding and replacing a bad SSA drive. Most SSA setups at company are software mirrored, two drives per volume group. For the most part, this document deals with these situations. RAID 5 is only currently in use on host, but will become more common in new installations as hardware that supports RAID 5 volume groups in HACMP arrive. While most steps are similar, this document does not deal with these situations at the time of this writing, and care should be taken when applying this procedure to replacing RAID 5 disks.

Prerequisites

  • A replacement SSA drive cartridge of the same type and size as the original.
  • sudo access

Procedure

  1. Identify the Failed Disk
    1. Issue the sudo /usr/sbin/ssaidentify -l pdiskname -y command.
    2. Locate the SSA drawer(s) attached to the server issuing the error, and find the disk unit with the amber LED blinking. This is the failed disk.
    3. Sometimes there will only be an SSA-related error on the system in question, with no indication of a failed drive. A common indication is hundreds of SSA OPEN LINK errors in errpt | more .
    4. In this case, use sudo diag to run Diagnostic Routines -- System Verification on all SSA and pdisk devices. You will be alerted to which device is not working properly by the diagnostics. 
  2. Remove the Failed Disk From the LVM
    1. Both the pdisk device and associated hdisk device will need to be removed from the LVM.
    2. Use smit devices to run SSA Disks -- SSA Physical Disks -- Show Physical to Logical SSA Disk Relationship to find the hdisk device that corresponds with the pdisk device.
    3. In cases where the drive completely failed, the mapping of hdisk to pdisk may have been lost. With cases like this, it is necessary to identify the "orphaned" hdisk using smit devices to run SSA Disks -- SSA Logical Disks -- Show Logical to Physical SSA Disk Relationship.
    4. Identify the Volume Group that the hdisk device belongs to with: lspv | grep hdiskname.
    5. Unmirror the Volume Group using: sudo unmirrorvg VGname hdiskname
    6. Remove the failed disk from the Volume Group with: sudo reducevg VGname hdiskname
    7. Remove the pdisk device from the ODM: sudo rmdev -l pdiskname -d
    8. Remove the hdisk device from the ODM: sudo rmdev -l hdiskname -d
  3. Replace the Failed Disk
    1. Physically remove the failed disk identified previously from the SSA drawer. In the common model 7133 drawers, there is a release button located underneath the red pull tab on the face of the cartridge. Push the release and pull the red pull tab to remove the drive from the system.
    2. Release the same handle on the replacement cartridge and insert into the system. Make sure the handle locks and the drive is firmly seated.
  4. Integrate the New Disk
    1. Reestablish the old hdisk and pdisk designations: sudo cfgmgr.
    2. Verify that the old hdisk and pdisk designations are being used: lsdev -C | egrep (hdiskname|pdiskname).
    3. If the original hdisk and pdisk devices were not reclaimed, they were not correctly deleted in the first place. Delete all the hdisk and pdisk devices, both old and new, and run cfgmgr again.
    4. Add the new disk back into the Volume Group: sudo extendvg VGname hdiskname.
    5. Restart mirroring on the Volume Group: sudo mirrorvg -m VGname hdiskname.
  5. Integrate the New Disk on the HA Node. (Optional)
    1. If the SSA disk was installed in a shared drawer in an HACMP cluster, it is now necessary to reinstate it on the failover node to ensure proper failovers.
    2. Schedule an emergency maintenance outage.
    3. Transfer any users to backup servers if necessary, depending on the application or database that was affected by this disk failure. You may need to consult with Database Support and/or Applications Support.
    4. Shut down the application or database that is using the affected volume group. This ensures a quiesced system to minimize the chance of problems during the procedure.
    5. Remove the disk locks on the primary node: sudo varyonvg -b -u VGname
    6. Delete the volume group from the failover node: sudo exportvg VGname
    7. Delete the hdisk and pdisk devices for the failed SSA disk on the failover node:
      sudo rmdev -l hdiskname -d
      sudo rmdev -l pdiskname -d
    8. Detect the replacement SSA disk on the failover node: sudo cfgmgr -v
    9. Detect the new PVID for the replacement SSA disk on the failover node:
      sudo chdev -l hdiskname -a pv=yes
    10. Import the volume group on the failover node: sudo importvg -V MajorNumberfromPrimaryNode -y VGname hdisname
    11. Modify the volume group to remain dormant at boot on the failover node: sudo chvg -a n -Q n -x n VGname
    12. Varyoff the volume group on the secondary node: sudo varyoffvg VGname
    13. Reset the disk locks on the primary node: sudo varyonvg VGname
    14. Restart any applications or data servers and notify the users that they can resume using the application.