SHIFT

--- Sjoerd Hooft's InFormation Technology ---

User Tools

Site Tools


Sidebar

Sponsor:

Would you like to sponsor this site?
Or buy me a beer?:


Recently Changed Pages:

View All Pages
View All Q Pages


View All Tags


Sign up for Q to post comments.





WIKI Disclaimer: As with most other things on the Internet, the content on this wiki is not supported. It was contributed by me and is published “as is”. It has worked for me, and might work for you.
Also note that any view or statement expressed anywhere on this site are strictly mine and not the opinions or views of my employer.


Terms And Conditions for Q users


Pages with comments

PageDateDiscussionTags
2019/05/01 14:08 2 Comments
2019/03/15 16:02 1 Comment
2019/03/15 16:02 1 Comment
2019/03/15 16:02 3 Comments
2017/04/20 15:28 1 Comment
2017/04/20 15:23 1 Comment
2017/04/19 14:44 1 Comment
2017/04/17 20:10 1 Comment
2017/04/17 20:07 1 Comment
2017/04/17 19:58 1 Comment
2017/04/17 19:52 1 Comment

View All Comments

netappdiskreplace

Replace a Disk on a NetApp Filer

If you need to replace a disk in one of your netapp filers (for example because it is faulty) you can use the disk replace command:

disk replace start [-f] [-m] <disk_name> <spare_disk_name>
  • -f: skip confirmation
  • -m: allows mixing disks with different characteristics. It allows using the target disk with rotational speed that does not match that of the majority of disks in the aggregate. It also allows using the target disk from the opposite spare pool.

The disk replace command uses Rapid RAID Recovery to copy data from the specified file system disk to the specified spare disk. At the end of that process, roles of disks are reversed. The spare disk will replace the file system disk in the RAID group and the file system disk will become a spare.

The process can be stopped with:

disk replace stop <disk_name>

Recognizing a Failed Disk

Sometimes a disk is not functioning well anymore but isn't reporting that yet. In the Netapp onCommand manager, this looks like this:

netappdiskreplace02.jpg

As you can see there is one disk that is running on 100%, while other disks are not. Since they are in the same aggregate, and NetApp uses WAFL for their file layout, all disks should have roughly the same usage percentage (unless you have hot spots but even them I suspect it not to be like this).

Monitoring and Removing the Replaced Disk

Monitoring

You can monitor the progress with “sysconfig -r”. This will look like this:

Aggregate aggr1 (online, raid_dp) (block checksums)
  Plex /aggr1/plex0 (online, normal, active, pool0)
    RAID group /aggr1/plex0/rg0 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.39   1a    2   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1a.27   1a    1   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.42   1d    2   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.40   1d    2   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.55   1d    3   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.56   1a    3   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.25   1a    1   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.75   1d    4   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (replacing, copy in progress)
      -> copy   1a.60   1a    3   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (copy 0% completed)
      data      1d.72   1d    4   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.58   1d    3   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.23   1d    1   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.71   1a    4   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.43   1d    2   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304

You can also monitor the progress using aggr status -r aggr1. This will give you roughly the same output.

Remove the Replaced Drive

Once the reconstruction has completed, you will want to remove the drive. In order to help you properly identify the drive, you can have the RED LED blink on the drive in a consistent manner to make it obvious to the person who will be pulling the drive:

priv set advanced
blink_on 0c.32

    or

led_on oc.32
priv set admin

As you can see, the blink_on and led_on commands are privileged commands. Also note that using these commands will only have effect for a little while. After some time (but I'm not sure exactly how much time) the red LED will go off again.

Note: ONTAP 8.1 broke the led_on and blink_on commands.

For more information see: https://kb.netapp.com/support/index?page=content&id=1010831
Broken led_on and blink_on commands

Example

Use disk replace to replace a faulty disk:

filer01a> disk replace start 1a.73 1a.60
*** You are about to copy and replace the following file system disk ***
  Disk /aggr1/plex0/rg0/1a.73

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
***
Really replace disk 1a.73 with 1a.60? y
disk replace: Disk 1a.73 was marked for replacing.

Monitor progress:

> sysconfig -r
Aggregate aggr1 (online, raid_dp) (block checksums)
  Plex /aggr1/plex0 (online, normal, active, pool0)
    RAID group /aggr1/plex0/rg0 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.39   1a    2   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1a.27   1a    1   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.42   1d    2   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.40   1d    2   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.55   1d    3   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.56   1a    3   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.25   1a    1   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.75   1d    4   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (replacing, copy in progress)
      -> copy   1a.60   1a    3   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (copy 0% completed)
      data      1d.72   1d    4   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.58   1d    3   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.23   1d    1   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.71   1a    4   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.43   1d    2   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304


filer01a> aggr status -r aggr1
Aggregate aggr1 (online, raid_dp) (block checksums)
  Plex /aggr1/plex0 (online, normal, active, pool0)
    RAID group /aggr1/plex0/rg0 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.39   1a    2   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1a.27   1a    1   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.42   1d    2   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.40   1d    2   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.55   1d    3   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.56   1a    3   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.25   1a    1   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.75   1d    4   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.73   1a    4   9   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (replacing, copy in progress)
      -> copy   1a.60   1a    3   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304 (copy 0% completed)
      data      1d.72   1d    4   8   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.58   1d    3   10  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.23   1d    1   7   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.71   1a    4   7   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.43   1d    2   11  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304

    RAID group /aggr1/plex0/rg1 (normal, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   1a.59   1a    3   11  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      parity    1d.41   1d    2   9   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.57   1d    3   9   FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.24   1a    1   8   FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.74   1a    4   10  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.26   1a    1   10  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.44   1a    2   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.76   1a    4   12  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.45   1d    2   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.61   1d    3   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1d.29   1d    1   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
      data      1a.77   1a    4   13  FC:B   0   ATA  7200 635555/1301618176 635858/1302238304

Physically replace disk and then assigned newly inserted disk to filer:

filer01a> disk show -n
  DISK       OWNER                      POOL   SERIAL NUMBER         HOME
------------ -------------              -----  -------------         -------------
1a.73        Not Owned                  NONE   N034TX1L

filer01a> disk show -n
  DISK       OWNER                      POOL   SERIAL NUMBER         HOME
------------ -------------              -----  -------------         -------------
1a.73        Not Owned                  NONE   N034TX1L
filer01a> disk assign 1a.73 -o filer01a
filer01a> disk show -n
disk show: No disks match option -n.
filer01a> disk show -v
  DISK       OWNER                      POOL   SERIAL NUMBER         HOME
------------ -------------              -----  -------------         -------------
0d.52        filer01b(151000001)    Pool0  3SJ0WW3V00009035Q971  filer01b(151000001)
0c.51        filer01a(151000000)    Pool0  J0XP8DVN              filer01a(151000000)
0c.61        filer01a(151000000)    Pool0  6SJ4XYHJ0000B2021M8H  filer01a(151000000)
0d.54        filer01b(151000001)    Pool0  3SJ0WPA000009035HK94  filer01b(151000001)
...<cut>...
1a.61        filer01a(151000000)    Pool0  P8H78HMF              filer01a(151000000)
1a.73        filer01a(151000000)    Pool0  N034TX1L              filer01a(151000000)

Disk Replace Message About Wrong Size

filer01a> disk replace start 1d.29 1a.73
*** You are about to copy and replace the following file system disk ***
  Disk /aggr1/plex0/rg1/1d.29

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      data      1d.29   1d    1   13  FC:A   0   ATA  7200 635555/1301618176 635858/1302238304
***
Disk 1a.73 is bigger than disk 1d.29.
Only 636 GB will be used on disk 1a.73.
Really replace disk 1d.29 with 1a.73? y
disk replace: Disk 1d.29 was marked for replacing.

onCommand Status

You can also see the pending status in OnCommand:
netappdiskreplace01.jpg

You could leave a comment if you were logged in.
netappdiskreplace.txt · Last modified: 2013/10/08 21:44 by sjoerd