Don’t Trust Your Engineer

Tuesday night (12/02/2008), one harddisk on Toyota’s server going down. That disk located in the mirror which build using Veritas Volume Manager. Toyota is my office’s biggest & carping customer. So many people in my office going busy with this accident. From Tuesday night I’m as support engineer have made the manual procedure for replacing the failure disk. But guess what? They prefer to ask the Veritas engineer rather than follow my procedure. So why bothering me call me and report the problem? Why they didn’t ask the expert only? I think engineer like me only used for physical work beside troblesho0ting things.

Maybe all the manager in my office get this lesson : “don’t trust your engineer, ask the expert only”.

Finally, we replaced the broken disk at Wednesday night. With all eyes on me, I executed the replacing procedures. The broken disk is the rootmirror disk in SunFire V890. The procedure for replace rootmirror disk which hold by Veritas Volume Manager (VxVM) is like this :

  1. Check the status of all disks that hold by Veritas Volume Manager using :
    # vxdisk list
    Using this command for locate which disk that failed. See the following illustration :vxvm1
  2. Check whether the failed disk can be remove from the operating system (in Toyota case, the OS is Solaris 9/04) :
    # luxadm remove_device /dev/rdsk/c1t2d0s2
    If that command cannot be executed, you may just pull out the disk from the server. After that you must replace the disk with the new one as soon as possible.
  3. Run devfsadm and cfgadm -c configure c2 from Solaris OS to let the system know the new disk. In Toyota case, the failed disk located at c1t2d0. See the following illustration before and after the OS knowing the new disk :vxvm2 vxvm3
  4. Try to see all the disk using format command. If the new disk has been appears, you must label the disk.
  5. Run the vxdctl, this is the command for let the VxVM knows all the devices that connected to the server (including the new attached devices).
  6. Run the vxdiskadm (disk configurations menu of VxVM), using this command :
    # vxdiskadm
    See the following illustration :vxvm4
  7. Choose the fourth menu in the vxdiskadm which is "Remove a disk for replacement". Choose this menu by type “4” in the vxdiskadm menu. Program will see the failed disk that has been remove from VxVM menu. Choose that disk for removal process. In Toyota case, the failed disk was rootmirror01. See the following illustration :vxvm5
  8. After that choose the fifth menu which is "Replace a failed or removed disk". Choose this menu by type “5” in the vxdiskadm menu. By running this program Veritas will know that servers has one new disk that haven’t assign to the Veritas Volume Manager. In Toyota case, VxVM knows that c1t2d0 is the new disk that hasn’t assign into VxVM. They will ask you, whether you want to assign the c2t1d0 to be rootmirror01. When VxVM ask you to encapsulate disk, you must reject that. Instead of encapsulate the new disk, you must answer yes when VxVM offer you to initialize the new disk. See the following illustration :vxvm6
  9. Follow the rest of the vxdiskadm menu, don’t worry it’s so easy. After that quit the menu. The new disk has been added into VxVM configuration and they start the synchronization between rootmirror01 with the rootdisk01.  Compare the list of all disks after disk replacement :
    vxvm7
  10. The synchronization process occured as long 2 hours for 146 GB disks. We can check the progress of synchronization process using this command :
    # vxtask list
    See the following example :vxvm8

6 thoughts on “Don’t Trust Your Engineer

  1. Tirta,

    You sure stuff like this can you share in this kind of public media?
    Wouldn’t it be dangerous somehow?

    Hehe. Btw, same manner in my company also. But, apparently not as bad as your story.

  2. @ sahat : yes, of course I can share it to public media. It’s just disk configuration…I think it just an example for explain how to replace failed disk in the Veritas Volume Manager. It’s rare case, so difficult to see this kind of problem. Dangerous? please explain what kind of danger can be?

  3. I don’t know, maybe your company rival can take advantages from this lessons. In simple click, they can just down your configuration and making outages.

    In Telecommunication area, this kind of knowledge sharing definitely harmful since any body can connect to whole networks.

  4. @ sahat : thanks for your advice, but I think it’s applicable to telecommunications world. In data center management (like this case) you must see the fact, someone who want to make the system down must know several important things =
    – Firewall configuration
    – IP address
    – root password

    But thanks anyway for your attentions.

Leave a Reply