一聚教程网:一个值得你收藏的教程网站

热门教程

oracle数据库11.2.0.4 RAC CRS diskgroup auto dismount问题

时间:2022-06-29 09:41:58 编辑:袖梨 来源:一聚教程网

这几天出差到甘肃,某客户的集群有点小问题。其中节点的crs检测发现有点异常,不过怪异的是业务却没有受到任何影响。如下是我检测的结果:


[grid@tysy1 bin]$ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
 
[grid@tysy1 bin]$ crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   6fe50eed90fe4f44bf68c2da8459bfc1 (ORCL:OCR_VOTE01) [CRS]
 2. ONLINE   7afb21a9c8c94f27bf6aecb55d4e6e58 (ORCL:OCR_VOTE02) [CRS]
 3. ONLINE   77c7c00e799a4f47bf41316f27a1d95c (ORCL:OCR_VOTE03) [CRS]
Located 3 voting disk(s).

我检测发现crsct status res -t -init 执行ok,执行crsctl status res -t报错。由于客户使用的是asmlib,我检查发现asmlib的disk都是正常的,如下:


[grid@tysy1 bin]$  /etc/init.d/oracleasm  listdisks
ARCHVOL
DATAVOL01
DATAVOL02
DATAVOL03
DATAVOL04
FRAVOL
OCR_VOTE01
OCR_VOTE02
OCR_VOTE03
 
[root@tysy1 disks]# /etc/init.d/oracleasm querydisk -d OCR_VOTE01
Disk "OCR_VOTE01" is a valid ASM disk on device [8,21]
[root@tysy1 disks]# /etc/init.d/oracleasm querydisk -d OCR_VOTE02
Disk "OCR_VOTE02" is a valid ASM disk on device [8,22]
[root@tysy1 disks]# /etc/init.d/oracleasm querydisk -d OCR_VOTE03
Disk "OCR_VOTE03" is a valid ASM disk on device [8,23]
[root@tysy1 disks]# ls -ltr /dev/sdb*
brw-rw---- 1 root disk 8, 16 Sep  1 20:23 /dev/sdb
brw-rw---- 1 root disk 8, 17 Sep  1 20:23 /dev/sdb1
brw-rw---- 1 root disk 8, 21 Sep  1 20:23 /dev/sdb5
brw-rw---- 1 root disk 8, 22 Sep  1 20:23 /dev/sdb6
brw-rw---- 1 root disk 8, 23 Sep  1 20:23 /dev/sdb7
brw-rw---- 1 root disk 8, 24 Sep  1 20:23 /dev/sdb8
brw-rw---- 1 root disk 8, 25 Sep  1 20:23 /dev/sdb9
brw-rw---- 1 root disk 8, 26 Sep  1 20:23 /dev/sdb10
brw-rw---- 1 root disk 8, 27 Sep  1 20:23 /dev/sdb11
brw-rw---- 1 root disk 8, 28 Sep  1 20:23 /dev/sdb12

检查asmlib的状态都是ok的,同时在节点1的查看asm disk,发现crs磁盘组都无法识别:


[grid@tysy1 bin]$
 
NAME       PATH              STATE    GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST LABEL
---------- ----------------- -------- ------------ ----------- ------- ------------ ------- ------------
           ORCL:OCR_VOTE01   NORMAL              0           0 CLOSED  MEMBER       ONLINE  OCR_VOTE01
           ORCL:OCR_VOTE02   NORMAL              0           1 CLOSED  MEMBER       ONLINE  OCR_VOTE02
           ORCL:OCR_VOTE03   NORMAL              0           2 CLOSED  MEMBER       ONLINE  OCR_VOTE03
ARCHVOL    ORCL:ARCHVOL      NORMAL              1           0 CACHED  MEMBER       ONLINE  ARCHVOL
DATAVOL01  ORCL:DATAVOL01    NORMAL              5           0 CACHED  MEMBER       ONLINE  DATAVOL01
DATAVOL02  ORCL:DATAVOL02    NORMAL              5           1 CACHED  MEMBER       ONLINE  DATAVOL02
DATAVOL03  ORCL:DATAVOL03    NORMAL              6           0 CACHED  MEMBER       ONLINE  DATAVOL03
DATAVOL04  ORCL:DATAVOL04    NORMAL              6           1 CACHED  MEMBER       ONLINE  DATAVOL04
FRAVOL     ORCL:FRAVOL       NORMAL              7           0 CACHED  MEMBER       ONLINE 
FRAVOL
同时查看crs日志也会发现Oracle 认为crs磁盘中无法访问:


015-09-02 14:06:19.302: [OCRCHECK][1069037312]ocrcheck starts...
2015-09-02 14:06:19.369: [  OCRASM][1069037312]proprasmo: kgfoCheckMount return [6]. Cannot proceed with dirty open.
2015-09-02 14:06:19.369: [  OCRASM][1069037312]proprasmo: Error in open/create file in dg [CRS]
[  OCRASM][1069037312]SLOS : SLOS: cat=6, opn=kgfo, dep=0, loc=kgfoCkMt03
 
2015-09-02 14:06:19.369: [  OCRASM][1069037312]ASM Error Stack :
2015-09-02 14:06:19.391: [  OCRASM][1069037312]proprasmo: kgfoCheckMount returned [6]
2015-09-02 14:06:19.391: [  OCRASM][1069037312]proprasmo: The ASM disk group CRS is not found or not mounted
2015-09-02 14:06:19.392: [  OCRRAW][1069037312]proprioo: Failed to open [+CRS]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2015-09-02 14:06:19.392: [  OCRRAW][1069037312]proprioo: No OCR/OLR devices are usable
2015-09-02 14:06:19.392: [  OCRASM][1069037312]proprasmcl: asmhandle is NULL
2015-09-02 14:06:19.392: [  OCRRAW][1069037312]proprinit: Could not open raw device
2015-09-02 14:06:19.392: [  OCRASM][1069037312]proprasmcl: asmhandle is NULL
2015-09-02 14:06:19.392: [ default][1069037312]a_init:7!: Backend init unsuccessful : [26]
2015-09-02 14:06:19.392: [OCRCHECK][1069037312]initreboot: Failed to initialize OCR in REBOOT level. Retval:[26] Error:[PROC-26: Error while accessing the physical storage
]
[OCRCHECK][1069037312]initreboot: Attempting to initialize OCR in DEFAULT level and update a key so that vote information is updated.
2015-09-02 14:06:19.393: [  OCRMSG][1069037312]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2)
2015-09-02 14:06:19.393: [  OCRMSG][1069037312]GIPC error [29] msg [gipcretConnectionRefused]
2015-09-02 14:06:19.393: [  OCRMSG][1069037312]prom_connect: error while waiting for connection complete [24]
2015-09-02 14:06:19.394: [OCRCHECK][1069037312]initreboot: Failed to initialize OCR in DEFAULT level. Retval:[32] Error:[PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]]
2015-09-02 14:06:19.394: [OCRCHECK][1069037312]Failed to access OCR repository. Retval [26] Error [PROC-26: Error while accessing the physical storage
]
2015-09-02 14:06:19.394: [OCRCHECK][1069037312]Failed to initialize ocrchek2
2015-09-02 14:06:19.394: [OCRCHECK][1069037312]Exiting [status=failed]...

到这里完全没有任何思路,我们还是来仔细分析下asm alert log,如下所示:


Tue Sep 01 23:07:03 2015
NOTE: process _b000_+asm1 (19088) initiating offline of disk 0.3915955003 (OCR_VOTE01) with mask 0x7e in group 4
NOTE: process _b000_+asm1 (19088) initiating offline of disk 1.3915955004 (OCR_VOTE02) with mask 0x7e in group 4
NOTE: process _b000_+asm1 (19088) initiating offline of disk 2.3915955005 (OCR_VOTE03) with mask 0x7e in group 4
NOTE: checking PST: grp = 4
GMON checking disk modes for group 4 at 25 for pid 30, osid 19088
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: checking PST for grp 4 done.
NOTE: initiating PST update: grp = 4, dsk = 0/0xe968bb3b, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 4, dsk = 1/0xe968bb3c, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 4, dsk = 2/0xe968bb3d, mask = 0x6a, op = clear
GMON updating disk modes for group 4 at 26 for pid 30, osid 19088
ERROR: no read quorum in group: required 2, found 0 disks
Tue Sep 01 23:07:03 2015
NOTE: cache dismounting (not clean) group 4/0xD7184BC4 (CRS)
WARNING: Offline for disk OCR_VOTE01 in mode 0x7f failed.
WARNING: Offline for disk OCR_VOTE02 in mode 0x7f failed.
WARNING: Offline for disk OCR_VOTE03 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 19090, image: oracle@tysy1 (B001)
Tue Sep 01 23:07:03 2015
NOTE: halting all I/Os to diskgroup 4 (CRS)
Tue Sep 01 23:07:03 2015
NOTE: LGWR doing non-clean dismount of group 4 (CRS)
NOTE: LGWR sync ABA=15.189 last written ABA 15.189
Tue Sep 01 23:07:03 2015
kjbdomdet send to inst 2
detach from dom 4, sending detach message to inst 2
Tue Sep 01 23:07:03 2015
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 4 invalid = TRUE
 0 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Tue Sep 01 23:07:03 2015
WARNING: dirty detached from domain 4
NOTE: cache dismounted group 4/0xD7184BC4 (CRS)
SQL> alter diskgroup CRS dismount force /* ASM SERVER:3608693700 */
Tue Sep 01 23:07:03 2015
NOTE: cache deleting context for group CRS 4/0xd7184bc4
GMON dismounting group 4 at 27 for pid 35, osid 19090
NOTE: Disk OCR_VOTE01 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_VOTE02 in mode 0x7f marked for de-assignment
NOTE: Disk OCR_VOTE03 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 4
Tue Sep 01 23:07:18 2015
ASM Health Checker found 1 new failures
Tue Sep 01 23:07:33 2015
SUCCESS: diskgroup CRS was dismounted
SUCCESS: alter diskgroup CRS dismount force /* ASM SERVER:3608693700 */
Tue Sep 01 23:07:33 2015
NOTE: diskgroup resource ora.CRS.dg is offline
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group CRS
Tue Sep 01 23:07:33 2015
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_4208.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
......
......
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_4208.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
Wed Sep 02 00:40:16 2015
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_4208.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
WARNING: requested mirror side 1 of virtual extent 4 logical extent 0 offset 294912 is not allocated; I/O request failed
WARNING: requested mirror side 2 of virtual extent 4 logical extent 1 offset 294912 is not allocated; I/O request failed
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_4208.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Wed Sep 02 00:40:16 2015
SQL> alter diskgroup CRS check /* proxy */
ORA-15032: not all alterations performed
ORA-15001: diskgroup "CRS" does not exist or is not mounted
ERROR: alter diskgroup CRS check /* proxy */
Wed Sep 02 00:41:33 2015
NOTE: client exited [4197]
Wed Sep 02 00:41:34 2015
NOTE: [crsd.bin@tysy1 (TNS V1-V3) 13821] opening OCR file
Wed Sep 02 00:41:36 2015
NOTE: [crsd.bin@tysy1 (TNS V1-V3) 13834] opening OCR file
......
......
Wed Sep 02 14:06:08 2015
NOTE: [ocrcheck.bin@tysy1 (TNS V1-V3) 16749] opening OCR file
Wed Sep 02 14:06:19 2015
NOTE: [ocrcheck.bin@tysy1 (TNS V1-V3) 16815] opening OCR file
Wed Sep 02 14:13:58 2015
NOTE: [ocrcheck.bin@tysy1 (TNS V1-V3) 18667] opening OCR file
Wed Sep 02 14:17:52 2015
NOTE: [ocrcheck.bin@tysy1 (TNS V1-V3) 19633] opening OCR file

我们可以发现,asm做了一个check,发现失败(ASM Health Checker found 1 new failures),然后就强行把crs 磁盘组dismount了。我们进一步来分析下trace 文件的内容:


[grid@tysy1 client]$ cat /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_4208.trc
Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_4208.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
ORACLE_HOME = /u01/app/11.2.0/grid
System name:    Linux
Node name:      tysy1
Release:        2.6.32-358.el6.x86_64
Version:        #1 SMP Tue Jan 29 11:47:41 EST 2013
Machine:        x86_64
VM name:        VMWare Version: 6
Instance name: +ASM1
Redo thread mounted by this instance: 0
Oracle process number: 26
Unix process pid: 4208, image: oracle@tysy1 (TNS V1-V3)
 
*** 2015-09-01 23:07:33.936
*** SESSION ID:(289.3) 2015-09-01 23:07:33.936
*** CLIENT ID:() 2015-09-01 23:07:33.936
*** SERVICE NAME:() 2015-09-01 23:07:33.936
*** MODULE NAME:(crsd.bin@tysy1 (TNS V1-V3)) 2015-09-01 23:07:33.936
*** ACTION NAME:() 2015-09-01 23:07:33.936
 
WARNING:failed xlate 1
ORA-15078: ASM diskgroup was forcibly dismounted
......
......
WARNING:failed xlate 1
ORA-15078: ASM diskgroup was forcibly dismounted
 
*** 2015-09-02 00:40:16.217
WARNING:failed xlate 1
ORA-15078: ASM diskgroup was forcibly dismounted
ksfdrfms:Mirror Read file=+CRS.255.4294967295 fob=0x9f403518 bufp=0x7f48c1381a00 blkno=1096 nbytes=4096
WARNING:failed xlate 1
WARNING: requested mirror side 1 of virtual extent 4 logical extent 0 offset 294912 is not allocated; I/O request failed
ksfdrfms:Read failed from mirror side=1 logical extent number=0 dskno=65535
WARNING:failed xlate 1
WARNING: requested mirror side 2 of virtual extent 4 logical extent 1 offset 294912 is not allocated; I/O request failed
ksfdrfms:Read failed from mirror side=2 logical extent number=1 dskno=65535
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted

进一步我分析gmon进程trace 文件内容发现其中存在call stack的内容,如下:


---+ASM1_gmon_4091.trc
*** 2015-09-01 22:55:11.686
ERROR: GMON could not validate any PST Hbeat (grp 6) for Validate
2015-09-01 22:55:11.686182 : kfdpHbeatblk_transfer: Found pending Hbeat I/O in grp=6 @
        HBeatCblk=0x7f2bf4f337f8 iop=0x7f2bf4f33808 iostate=3.
kfdpHbeatCB_dump: at 0x7f2bf4f337f8 with ts=09/01/2015 22:55:11 iop=0x7f2bf4f33808, grp=6, disk=0/3915955008, isWrite=0 Hbeat #0 state=3 iostate=3
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdpHbeatblk_transfer()+953<-kfdpHbeat_doCheck()+662<-kfdpHbeat_timeout()+854<-kfdp_timeoutBg()+1642<-ksbcti()+513<-ksbabs()+1735<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
 
----- End of Abridged Call Stack Trace -----
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
 
*** 2015-09-01 23:07:03.571
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 4.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 4.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 4.
NOTE: Set to be offline flag for disk OCR_VOTE01 only locally: flag 0x3211
NOTE: Set to be offline flag for disk OCR_VOTE02 only locally: flag 0x3211
NOTE: Set to be offline flag for disk OCR_VOTE03 only locally: flag 0x3211
WARNING: Waited 15 secs for write IO to PST disk 0 in group 4.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 4.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 4.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 5.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 5.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 6.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 7.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 7.
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdpGc_doTobeoflnAsync()+37<-kfdpGc_checkTobeofln()+623<-kfdpGc_timeout()+17<-kfdp_timeoutBg()+1657<-ksbcti()+513<-ksbabs()+1735<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
。。。。。。
。。。。。。
--------------- HBEAT ------------------
kfdpHbeat_dump: state=1, inst=1, ts=33023031.450354176,
        rnd=1036706912.2853146286.3238582348.2977351792.
kfk io-queue:    (nil)
kfdpHbeatCB_dump: at 0x7f2bf4d28248 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28258, grp=4, disk=0/3915955003, isWrite=1 Hbeat #17 state=4 iostate=3
kfdpHbeatCB_dump: at 0x7f2bf4d28080 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28090, grp=4, disk=1/3915955004, isWrite=1 Hbeat #17 state=4 iostate=3
kfdpHbeatCB_dump: at 0x7f2bf4d27eb8 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d27ec8, grp=4, disk=2/3915955005, isWrite=1 Hbeat #17 state=4 iostate=3
NOTE: kfdp_updateInt: forceDismount grp 4
NOTE: GMON: failed to update modes: triggering force dismount of group 4
2015-09-01 23:07:03.625783 : kfdpHbeatblk_transfer: Found pending Hbeat I/O in grp=4 @
        HBeatCblk=0x7f2bf4d28248 iop=0x7f2bf4d28258 iostate=3.
kfdpHbeatCB_dump: at 0x7f2bf4d28248 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28258, grp=4, disk=0/3915955003, isWrite=1 Hbeat #17 state=4 iostate=3
2015-09-01 23:07:03.625823 : kfdpHbeatblk_transfer: Found pending Hbeat I/O in grp=4 @
        HBeatCblk=0x7f2bf4d28080 iop=0x7f2bf4d28090 iostate=3.
kfdpHbeatCB_dump: at 0x7f2bf4d28080 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d28090, grp=4, disk=1/3915955004, isWrite=1 Hbeat #17 state=4 iostate=3
2015-09-01 23:07:03.625844 : kfdpHbeatblk_transfer: Found pending Hbeat I/O in grp=4 @
        HBeatCblk=0x7f2bf4d27eb8 iop=0x7f2bf4d27ec8 iostate=3.
kfdpHbeatCB_dump: at 0x7f2bf4d27eb8 with ts=09/01/2015 23:06:47 iop=0x7f2bf4d27ec8, grp=4, disk=2/3915955005, isWrite=1 Hbeat #17 state=4 iostate=3
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdpHbeatblk_transfer()+953<-kfdp_forceDismountAsyncGmon()+619<-kfdp_updateInt()+664<-kfdp_updateDskBg()+814<-ksbabs()+771<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
 
----- End of Abridged Call Stack Trace -----
 
*** 2015-09-01 23:07:03.906
GMON dismounting group 4 at 27 for pid 35, osid 19090
NOTE: kfdp_doDismount: dismount grp 4
 
*** 2015-09-01 23:07:06.609
NOTE: kfdpUtil_freeSlMsg: ksvtst for grp: 4 KSV status 0x10
 
*** 2015-09-01 23:07:36.732
NOTE: kfdpUtil_freeSlMsg: ksvtst for grp: 4 KSV status 0x10
ERROR: GMON could not validate any PST Hbeat (grp 1) for Validate
 
*** 2015-09-01 23:11:03.981
2015-09-01 23:11:03.981362 : kfdpHbeatblk_transfer: Found pending Hbeat I/O in grp=1 @
        HBeatCblk=0x7f2bf4f32f10 iop=0x7f2bf4f32f20 iostate=3.
kfdpHbeatCB_dump: at 0x7f2bf4f32f10 with ts=09/01/2015 23:11:03 iop=0x7f2bf4f32f20, grp=1, disk=0/3915955002, isWrite=0 Hbeat #0 state=3 iostate=3
----- Abridged Call Stack Trace -----
ksedsts()+465<-kfdpHbeatblk_transfer()+953<-kfdpHbeat_doCheck()+662<-kfdpHbeat_timeout()+854<-kfdp_timeoutBg()+1642<-ksbcti()+513<-ksbabs()+1735<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
 
----- End of Abridged Call Stack Trace -----

根据上述内容,分析发现与Oracle  Bug 18694414 完全一致。如下是该bug的call stack的描述:


ksedsts()+465<-kfdpHbeatblk_transfer()+953<-kfdpHbeat_doCheck()+662<-kfdpHbeat
@ _timeout()+854<-kfdp_timeoutBg()+1642<-ksbcti()+513<-ksbabs()+1735<-ksbrdp()+1
@ 045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+
@ 265<-main()+201<-__libc_start_main()+253
根据Oracle mos文档的描述,建议将_asm_hbeatiowait 参数调整为185,因为LInux环境disk 默认的timeout为180,如下:


[root@tysy1 ~]# cat /sys/block/sdb/device/timeout
180
[root@tysy1 ~]#

热门栏目