Beiträge getaggt mit High Availability

Dealing with Oracle Database Block Corruption

recover

Media errors don’t always destroy files completely. Sometimes, only small parts of the file are damaged respectively corrupted. It may even not be noticed by end users or admins for a while. This article shows how to detect block corruption and recover from it.  The demo is done on 11g but the shown techniques work in the same way for 12c also. I have corrupted blocks on my demo database affecting the emp table of the user scott:

SQL> select * from scott.emp;
select * from scott.emp
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 131)
ORA-01110: data file 4: '/home/oracle/prima/users01.dbf'

This shows that not the whole tablespace is affected:

SQL> select * from scott.dept;

 DEPTNO DNAME          LOC
---------- -------------- -------------
 10 ACCOUNTING     NEW YORK
 20 RESEARCH       DALLAS
 30 SALES          CHICAGO
 40 OPERATIONS     BOSTON

SQL> select table_name,tablespace_name from dba_tables where owner='SCOTT';

TABLE_NAME                     TABLESPACE_NAME
------------------------------ ------------------------------
DEPT                           USERS
EMP                            USERS

Whenever we get these kind of error messages, we need to check all the blocks. Typically, error messages about block corruption come up during an RMAN backup, but I like to defer that a little to show an 11g New Feature before. Checking all blocks now:

RMAN> validate check logical database;

Starting validate at 16-NOV-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=107 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00001 name=/home/oracle/prima/system01.dbf
input datafile file number=00002 name=/home/oracle/prima/sysaux01.dbf
input datafile file number=00003 name=/home/oracle/prima/undotbs01.dbf
input datafile file number=00004 name=/home/oracle/prima/users01.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
1    OK     0              17594        38400           277491
 File Name: /home/oracle/prima/system01.dbf
 Block Type Blocks Failing Blocks Processed
 ---------- -------------- ----------------
 Data       0              13854
 Index      0              4487
 Other      0              2465

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
2    OK     0              20381        25600           277631
 File Name: /home/oracle/prima/sysaux01.dbf
 Block Type Blocks Failing Blocks Processed
 ---------- -------------- ----------------
 Data       0              869
 Index      0              957
 Other      0              3393

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
3    OK     0              541          22784           277631
 File Name: /home/oracle/prima/undotbs01.dbf
 Block Type Blocks Failing Blocks Processed
 ---------- -------------- ----------------
 Data       0              0
 Index      0              0
 Other      0              22243

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
4    FAILED 0              1133         1280            271968
 File Name: /home/oracle/prima/users01.dbf
 Block Type Blocks Failing Blocks Processed
 ---------- -------------- ----------------
 Data       0              10
 Index      0              0
 Other      1              137

validate found one or more corrupt blocks
See trace file /home/oracle/prima/diag/rdbms/prima/prima/trace/prima_ora_18316.trc for details
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
including current SPFILE in backup set
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Control File and SPFILE
===============================
File Type    Status Blocks Failing Blocks Examined
------------ ------ -------------- ---------------
SPFILE       OK     0              2
Control File OK     0              612
Finished validate at 16-NOV-10

We have already a couple of 11g New Features here: The syntax has changed from backup validate (since 9i) to just validate (11g) – probably to make clear that this does not perform a backup but a check of corrupted blocks instead. Before 11g, the command did not show the verbose list of checked respectively corrupted blocks like we see above.

The addition check logical will also check for logical block corruption, which is not done by default.

Checking all the blocks here is more efficient than doing an immediate recovery of the one block mentioned in the error message above. There may be many more not spotted yet. Same is true for an ordinary backup that would interrupt at the first spotted corrupted block as we will see later on.

The validate command populated the view v$database_block_corruption, that is now internally read by RMAN in order to repair all the found corrupted blocks. The next 11g New Feature here is: It will take the block out of the Flashback Logs, if present there!

RMAN> blockrecover corruption list;

Starting recover at 16-NOV-10
using channel ORA_DISK_1
searching flashback logs for block images
finished flashback log search, restored 1 blocks

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 16-NOV-10

I was so bold that I did not even take a backup before – to make sure this new feature must be used:

RMAN> list backup;

specification does not match any backup in the repository

I’m going to take a backup now, but before that, I cause again block corruption. So we will see that RMAN stops at the first noticed corrupted block. No Third-Party-Tool would recognize the block corruption, BTW, so we have another reason to actually use RMAN here. If we say backup check logical database instead of just backup database, RMAN will also check for logical block corruption during the backup.

[oracle@uhesse-pc skripte]$ rman target sys/oracle@prima

Recovery Manager: Release 11.2.0.2.0 - Production on Tue Nov 16 15:22:14 2010

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: PRIMA (DBID=1967518488)

RMAN> backup database;

Starting backup at 16-NOV-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=110 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00001 name=/home/oracle/prima/system01.dbf
input datafile file number=00002 name=/home/oracle/prima/sysaux01.dbf
input datafile file number=00003 name=/home/oracle/prima/undotbs01.dbf
input datafile file number=00004 name=/home/oracle/prima/users01.dbf
channel ORA_DISK_1: starting piece 1 at 16-NOV-10
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 11/16/2010 15:22:22
ORA-19566: exceeded limit of 0 corrupt blocks for file /home/oracle/prima/users01.dbf
continuing other job steps, job failed will not be re-run
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
including current control file in backup set
including current SPFILE in backup set
channel ORA_DISK_1: starting piece 1 at 16-NOV-10
channel ORA_DISK_1: finished piece 1 at 16-NOV-10
piece handle=/home/oracle/flashback/PRIMA/backupset/2010_11_16/o1_mf_ncsnf_TAG20101116T152221_6g54wzkb_.bkp tag=TAG20101116T152221 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================

RMAN-03009: failure of backup command on ORA_DISK_1 channel at 11/16/2010 15:22:22
ORA-19566: exceeded limit of 0 corrupt blocks for file /home/oracle/prima/users01.dbf

Again the same sequence as above validate check logical database & blockrecover corruption list will solve the problem. During the whole process, the users tablespace remains online and usable, except the emp table of scott.

Conclusion: We have a powerful tool with RMAN to spot and repair corrupted blocks by using intact versions of the corrupted blocks from backup (since 9i already) or even from Flashback Logs (since 11g) – which is probably faster – while keeping up the availability of the affected tablespace.

, , , ,

25 Kommentare

Advert: I will present at the DOAG Conference about 11g Data Guard New Features

The German Oracle User Group (DOAG) is staging their annual Conference. I will give a presentation on 19-NOV-2010 about 11g Data Guard New Features there. It will be in German Language and cover

  • Creation of Physical Standby Database without previous Backup
  • Embedding the Primary & Standby into Oracle Restart
  • Real-Time Query (Using Physical Standby for Reporting while Redo Apply is on)
  • Snapshot Standby (Using Physical Standby for Testing while still receiving Redo Protocol from Primary)
  • Fast-Start Failover in Maximum Performance Mode

Every Topic will be accompanied by Live Demonstrations on the Command Line. You may enroll online here.

, ,

Hinterlasse einen Kommentar

Voting Disk and OCR in 11gR2: Some changes

Having just delivered an Oracle Database 11gR2 RAC Admin course, I’d like to point out some remarkable changes in the way we handle now the important Clusterware components Voting Disk and Oracle Cluster Registry (OCR): Amazingly, we can now store the two inside of an Automatic Storage Management (ASM) Disk Group, which was not possible in 10g.

The OCR is striped and mirrored (if we have a redundancy other than external), similar as ordinary Database Files are. So we can now leverage the mirroring capabilities of ASM to mirror the OCR also, without having to use multiple RAW devices for that purpose only. The Voting Disk (or Voting File, as it is now also referred to) is not striped but put as a whole on ASM Disks – if we use a redundancy of normal on the Diskgroup, 3 Voting Files are placed, each on one ASM Disk into a different failgroup. Therefore, you need to have at least 3 failgroups for that diskgroup, like on this picture:

ASM Diskgroup that contains Voting Files and OCR

This is a concern, if our ASM Diskgroups consist of only 2 ASM Disks respectively only 2 failgroups like with Extended RAC! Therefore, the new quorum failgroup clause was introduced:

create diskgroup data normal redundancy
 failgroup fg1 disk 'ORCL:ASMDISK1'
 failgroup fg2 disk 'ORCL:ASMDISK2'
 quorum failgroup fg3 disk 'ORCL:ASMDISK3'
 attribute 'compatible.asm' = '11.2.0.0.0';

The failgroup fg3 above needs only one small Disk (300 MB should be on the safe side here, since the Voting File is only about 280 MB in size) to keep one Mirror of the Voting File. fg1 and fg2 will contain each one Voting File and all the other stripes of the Database Area as well, but fg3 will only get that one Voting File.

[root@uhesse1 ~]#  /u01/app/11.2.0/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   511de6e64e354f9bbf4be318fc928c28 (ORCL:ASMDISK1) [DATA]
 2. ONLINE   2f1973ed4be84f50bffc2475949b428f (ORCL:ASMDISK2) [DATA]
 3. ONLINE   5ed44fb7e79c4f79bfaf09b402ba70df (ORCL:ASMDISK3) [DATA]

Another important change regarding the Voting File is that it is no longer supported to take a manual backup of it with dd. Instead, the Voting File gets backed up automatically into the OCR. As a New Feature, you can now do a manual backup of the OCR any time you like, without having to wait until that is done automatically – which is also still done:

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -showbackup

uhesse1     2010/10/06 09:37:30     /u01/app/11.2.0/grid/cdata/cluhesse/backup00.ocr
uhesse1     2010/10/06 05:37:29     /u01/app/11.2.0/grid/cdata/cluhesse/backup01.ocr
uhesse1     2010/10/06 01:37:27     /u01/app/11.2.0/grid/cdata/cluhesse/backup02.ocr
uhesse1     2010/10/05 01:37:21     /u01/app/11.2.0/grid/cdata/cluhesse/day.ocr
uhesse1     2010/10/04 13:37:19     /u01/app/11.2.0/grid/cdata/cluhesse/week.ocr

Above are the automatic backups of the OCR as in earlier versions. Now the manual backup:

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -manualbackup
uhesse1     2010/10/06 13:07:03     /u01/app/11.2.0/grid/cdata/cluhesse/backup_20101006_130703.ocr

I got a manual backup on the default location on my master node. We can define another backup location for the automatic backups as well as for the manual backups – preferrable on a Shared Device that is accessible by all the nodes (which is not the case with /home/oracle, unfortunately 🙂 ):

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -backuploc /home/oracle
[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -manualbackup
uhesse1     2010/10/06 13:10:50     /home/oracle/backup_20101006_131050.ocr
uhesse1     2010/10/06 13:07:03     /u01/app/11.2.0/grid/cdata/cluhesse/backup_20101006_130703.ocr

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -showbackup
uhesse1     2010/10/06 09:37:30     /u01/app/11.2.0/grid/cdata/cluhesse/backup00.ocr
uhesse1     2010/10/06 05:37:29     /u01/app/11.2.0/grid/cdata/cluhesse/backup01.ocr
uhesse1     2010/10/06 01:37:27     /u01/app/11.2.0/grid/cdata/cluhesse/backup02.ocr
uhesse1     2010/10/05 01:37:21     /u01/app/11.2.0/grid/cdata/cluhesse/day.ocr
uhesse1     2010/10/04 13:37:19     /u01/app/11.2.0/grid/cdata/cluhesse/week.ocr
uhesse1     2010/10/06 13:10:50     /home/oracle/backup_20101006_131050.ocr
uhesse1     2010/10/06 13:07:03     /u01/app/11.2.0/grid/cdata/cluhesse/backup_20101006_130703.ocr

Conclusion: The way to handle Voting Disk and OCR has changed significantly – they can be kept inside of an ASM Diskgroup especially.

, ,

88 Kommentare