Starting with version 126.96.36.199.1, Exadata Cells use Pro-Active Disk Quarantine to override any setting of DISK_REPAIR_TIME. This and some other topics related to ASM mirroring on Exadata Storage Servers is explained in a recent posting of my dear colleage Joel Goodman. Even if you are familiar with ASM on non-Exadata Environments, you may not have used ASM redundancy yet and therefore benefit from his explanations about it.
Addendum: Maybe the headline is a little misleading as I just got aware. DISK_REPAIR_TIME set on an ASM Diskgroup that is built upon Exadata Storage Cells is still in use and valid. It is just not referring to the Disk level (Griddisk on Exadata) but instead on the Cell level.
In other words: If a physical disk inside a Cell gets damaged, the Griddisks built upon this damaged disk get dropped from the ASM Diskgroups immediately without waiting for DISK_REPAIR_TIME, due to Pro-Active Disk Quarantine. But if a whole Cell goes offline (Reboot of that Storage Server, for example), the dependant ASM disks get not dropped from the respective Diskgroups for the duration of DISK_REPAIR_TIME.
#1 von Albert am Oktober 6, 2011 - 09:46
I do like this site from my Oracle Exadata and RAC interest perspective! Very good indeed!
#2 von Uwe Hesse am Oktober 8, 2011 - 08:07
Thank you, Albert, for stopping by now & then and for your nice words!
#3 von cfakhry am Oktober 31, 2013 - 12:57
Its good to know this latest feature ( i was thinking the interest we got to choose a HIGH redundancy on ASM / Exadata )
#4 von Ammar Semle am Februar 28, 2014 - 18:48
Hi Uwe, This article and the linked article by Joel was very useful indeed, i always had this confusion w.r.t D_R_T in Exadata. Keep up the good work.
#5 von Amos am Oktober 20, 2014 - 18:29
Hi Uwe, thanks for the post. Question, due to Pro-Active Disk Quarantine, grid disks on a bad physical disk would get dropped immediately, then rebalancing would kick in immediately, would it be better to just shut down a cell in order to utilize the disk_repair_time setting to avoid expensive rebalancing? It may sound crazy, but based on my experience, the rebalance on my X2 half rack usually took about 8 to 10 hours due to power limit set to 10 (not higher due to likely performance impact) after the bad disk is replaced. It’s a very heavy workload system, once when the re-balance was running, another event happened (bad CPU), which then caused whole cluster to crash. So my point is since during the disk rebalancing, the cluster is not protected under normal redundancy, why not just shut down the cell with the bad disk, and immediately replace the bad disk staged onsite already, then restart the cell, no re-balance needed. All could happen within 10 mins, assuming there are spared disks staged onsite.