Tuesday night (12/02/2008), one harddisk on Toyota’s server going down. That disk located in the mirror which build using Veritas Volume Manager. Toyota is my office’s biggest & carping customer. So many people in my office going busy with this accident. From Tuesday night I’m as support engineer have made the manual procedure for replacing the failure disk. But guess what? They prefer to ask the Veritas engineer rather than follow my procedure. So why bothering me call me and report the problem? Why they didn’t ask the expert only? I think engineer like me only used for physical work beside troblesho0ting things.
Maybe all the manager in my office get this lesson : “don’t trust your engineer, ask the expert only”.
Finally, we replaced the broken disk at Wednesday night. With all eyes on me, I executed the replacing procedures. The broken disk is the rootmirror disk in SunFire V890. The procedure for replace rootmirror
disk which hold by Veritas Volume Manager (VxVM) is like this :
- Check the status of all disks that hold by Veritas Volume Manager using :
# vxdisk list
Using this command for locate which disk that failed. See the following illustration :
- Check whether the failed disk can be remove from the operating system (in Toyota case, the OS is Solaris 9/04) :
# luxadm remove_device /dev/rdsk/c1t2d0s2
If that command cannot be executed, you may just pull out the disk from the server. After that you must replace the disk with the new one as soon as possible. - Run
devfsadm
andcfgadm -c configure c2
from Solaris OS to let the system know the new disk. In Toyota case, the failed disk located atc1t2d0
. See the following illustration before and after the OS knowing the new disk : - Try to see all the disk using
format
command. If the new disk has been appears, you mustlabel
the disk. - Run the
vxdctl
, this is the command for let the VxVM knows all the devices that connected to the server (including the new attached devices). - Run the
vxdiskadm
(disk configurations menu of VxVM), using this command :
# vxdiskadm
See the following illustration :
- Choose the fourth menu in the
vxdiskadm
which is"Remove a disk for replacement".
Choose this menu by type “4” in thevxdiskadm
menu. Program will see the failed disk that has been remove from VxVM menu. Choose that disk for removal process. In Toyota case, the failed disk wasrootmirror01
. See the following illustration : - After that choose the fifth menu which is
"Replace a failed or removed disk"
.Choose this menu by type “5” in the
vxdiskadm
menu. By running this program Veritas will know that servers has one new disk that haven’t assign to the Veritas Volume Manager. In Toyota case, VxVM knows thatc1t2d0
is the new disk that hasn’t assign into VxVM. They will ask you, whether you want to assign thec2t1d0
to berootmirror01
. When VxVM ask you toencapsulate
disk, you must reject that. Instead of encapsulate the new disk, you must answer yes when VxVM offer you toinitialize
the new disk. See the following illustration : - Follow the rest of the
vxdiskadm
menu, don’t worry it’s so easy. After that quit the menu. The new disk has been added into VxVM configuration and they start the synchronization betweenrootmirror01
with therootdisk01
. Compare the list of all disks after disk replacement :
- The synchronization process occured as long 2 hours for 146 GB disks. We can check the progress of synchronization process using this command :
# vxtask list
See the following example :
Tirta,
You sure stuff like this can you share in this kind of public media?
Wouldn’t it be dangerous somehow?
Hehe. Btw, same manner in my company also. But, apparently not as bad as your story.
@ sahat : yes, of course I can share it to public media. It’s just disk configuration…I think it just an example for explain how to replace failed disk in the Veritas Volume Manager. It’s rare case, so difficult to see this kind of problem. Dangerous? please explain what kind of danger can be?
I don’t know, maybe your company rival can take advantages from this lessons. In simple click, they can just down your configuration and making outages.
In Telecommunication area, this kind of knowledge sharing definitely harmful since any body can connect to whole networks.
@ sahat : thanks for your advice, but I think it’s applicable to telecommunications world. In data center management (like this case) you must see the fact, someone who want to make the system down must know several important things =
– Firewall configuration
– IP address
– root password
But thanks anyway for your attentions.
long time no read.. stumbled upon this, make me mumed wekekekek..
@ masliliks : hello…long time not see your comment 😀 thanks for coming