« 为何而心跳-Oracle Heartbeat研究 |
Blog首页
| 2006年 我们有更多的时间能浪费 »
Oracle10gR2 ASM 值得信赖么?
Oracle10g引入ASM以来,关于ASM的担心就从来没有停止过.因为ASM引入了一个新的ASM Instance,新的实例的稳定性一度成为了关注的焦点.
我们看一下这个新的ASM实例引入的后台进程:
[oracle@danaly bdump]$ ps -ef|grep ASM|grep -v grep oracle 3720 1 0 14:38 ? 00:00:00 asm_pmon_+ASM oracle 3722 1 0 14:38 ? 00:00:00 asm_psp0_+ASM oracle 3724 1 0 14:38 ? 00:00:00 asm_mman_+ASM oracle 3726 1 0 14:38 ? 00:00:00 asm_dbw0_+ASM oracle 3728 1 0 14:38 ? 00:00:00 asm_lgwr_+ASM oracle 3730 1 0 14:38 ? 00:00:00 asm_ckpt_+ASM oracle 3732 1 0 14:38 ? 00:00:00 asm_smon_+ASM oracle 3734 1 0 14:38 ? 00:00:00 asm_rbal_+ASM oracle 3736 1 0 14:38 ? 00:00:00 asm_gmon_+ASM oracle 3748 1 0 14:38 ? 00:00:00 asm_o000_+ASM oracle 3781 1 0 14:38 ? 00:00:00 oracle+ASM (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
|
这些进程和数据库的后台进程同样重要(甚至可以说更为重要),如果ASM实例Crash,数据库将会立刻中止.
今天这个10g的数据库就遇到了这样的问题.第一次,Oracle10gR2的ASM挂了。
检查数据库的alert文件,获得如下信息:
Thu Jan 19 13:34:11 2006 WARNING: inbound connection timed out (ORA-3136) Thu Jan 19 13:52:47 2006 Errors in file /opt/oracle/admin/danaly/bdump/danaly_asmb_4154.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Thu Jan 19 13:52:47 2006 ASMB: terminating instance due to error 15064 Instance terminated by ASMB, pid = 4154
|
看来是ASM实例出现问题,导致数据库down掉。
trace文件中记录了如下错误:
*** 2006-01-19 13:52:47.922 *** SERVICE NAME:(SYS$BACKGROUND) 2006-01-19 13:52:47.917 *** SESSION ID:(157.1) 2006-01-19 13:52:47.917 error 15064 detected in background process ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel
|
毫无疑问,都是ASM惹的祸。
检查ASM的日志文件记录了如下错误,显示磁盘组故障导致ASM终止:
Thu Jan 19 13:52:47 2006 Errors in file /opt/oracle/admin/+ASM/bdump/+asm_gmon_3948.trc: ORA-29702: error occurred in Cluster Group Service operation ORA-29702: error occurred in Cluster Group Service operation Thu Jan 19 13:52:47 2006 GMON: terminating instance due to error 29702 Instance terminated by GMON, pid = 3948
|
trace文件里记录了如下信息:
*** 2006-01-19 13:52:47.912 2006-01-19 13:52:47.912: [ CSSCLNT]clsssRecvMsg: comm error received, comrc 11, con (0xcc9fb48), msg (0xbfffe5d8), msgl 144 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: clsssRecvMsg failed 3 -1073748296) 2006-01-19 13:52:47.912: [ CSSCLNT]clssgsGroupGetStatus: returning 8 kgxgnpstat: received ABORT event from CLSS Group services Error [NM abort event ] @ 28019:719 error 29702 detected in background process ORA-29702: error occurred in Cluster Group Service operation ORA-29702: error occurred in Cluster Group Service operation
|
启动数据库,经过恢复,成功加载ASM磁盘组,随之数据库得以正常启动:
Thu Jan 19 14:38:16 2006 SQL> ALTER DISKGROUP ALL MOUNT Thu Jan 19 14:38:16 2006 NOTE: cache registered group ORADG number=1 incarn=0x728b66fe Thu Jan 19 14:38:16 2006 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface Thu Jan 19 14:38:16 2006 NOTE: Hbeat: instance first (grp 1) Thu Jan 19 14:38:20 2006 NOTE: start heartbeating (grp 1) NOTE: cache opening disk 0 of grp 1: VOL1 label:VOL1 Thu Jan 19 14:38:20 2006 NOTE: F1X0 found on disk 0 fcn 0.0 NOTE: cache opening disk 1 of grp 1: VOL2 label:VOL2 NOTE: cache opening disk 2 of grp 1: VOL3 label:VOL3 NOTE: cache opening disk 3 of grp 1: VOL4 label:VOL4 NOTE: cache mounting (first) group 1/0x728B66FE (ORADG) NOTE: starting recovery of thread=1 ckpt=17.8535 NOTE: advancing ckpt for thread=1 ckpt=17.8535 NOTE: cache recovered group 1 to fcn 0.401368 Thu Jan 19 14:38:21 2006 NOTE: opening chunk 1 at fcn 0.401368 ABA NOTE: seq=18 blk=8536 Thu Jan 19 14:38:21 2006 NOTE: cache mounting group 1/0x728B66FE (ORADG) succeeded SUCCESS: diskgroup ORADG was mounted Thu Jan 19 14:38:22 2006 NOTE: recovering COD for group 1/0x728b66fe (ORADG) SUCCESS: completed COD recovery for group 1/0x728b66fe (ORADG) Thu Jan 19 14:38:38 2006 Starting background process ASMB ASMB started with pid=12, OS id=3744
|
我们注意到,日志中几次出现heartbeat字样,原来heartbeat无处不在.
顺便把数据库的恢复信息也记录一下:
Thu Jan 19 14:38:43 2006 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 2 Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST WARNING: db_recovery_file_dest is same as db_create_file_dest Autotune of undo retention is turned on. IMODE=BR ILAT =18 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.1.0. System parameters with non-default values: processes = 150 __shared_pool_size = 75497472 __large_pool_size = 4194304 __java_pool_size = 4194304 __streams_pool_size = 0 spfile = +ORADG/danaly/spfiledanaly.ora sga_target = 943718400 control_files = +ORADG/danaly/controlfile/current.256.600173845, +ORADG/danaly/controlfile/current.257.600173845 db_block_checksum = FULL db_block_size = 8192 __db_cache_size = 851443712 compatible = 10.2.0.1.0 db_file_multiblock_read_count= 16 db_create_file_dest = +ORADG db_recovery_file_dest = +ORADG db_recovery_file_dest_size= 2147483648 undo_management = AUTO undo_tablespace = UNDOTBS2 recyclebin = ON remote_login_passwordfile= EXCLUSIVE db_domain = dispatchers = (PROTOCOL=TCP) (SERVICE=danalyXDB) job_queue_processes = 10 background_dump_dest = /opt/oracle/admin/danaly/bdump user_dump_dest = /opt/oracle/admin/danaly/udump core_dump_dest = /opt/oracle/admin/danaly/cdump audit_file_dest = /opt/oracle/admin/danaly/adump db_name = danaly open_cursors = 300 pga_aggregate_target = 314572800 PMON started with pid=2, OS id=3752 PSP0 started with pid=3, OS id=3754 MMAN started with pid=4, OS id=3756 DBW0 started with pid=5, OS id=3758 LGWR started with pid=6, OS id=3760 CKPT started with pid=7, OS id=3762 SMON started with pid=8, OS id=3764 RECO started with pid=9, OS id=3766 CJQ0 started with pid=10, OS id=3768 MMON started with pid=11, OS id=3770 Thu Jan 19 14:38:44 2006 starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... MMNL started with pid=12, OS id=3772 Thu Jan 19 14:38:44 2006 starting up 1 shared server(s) ... Thu Jan 19 14:38:44 2006 ALTER DATABASE MOUNT Thu Jan 19 14:38:44 2006 Starting background process ASMB ASMB started with pid=16, OS id=3779 Starting background process RBAL RBAL started with pid=17, OS id=3783 Loaded ASM Library - Generic Linux, version 2.0.0 (KABI_V2) library for asmlib interface Thu Jan 19 14:38:52 2006 SUCCESS: diskgroup ORADG was mounted Thu Jan 19 14:38:57 2006 Setting recovery target incarnation to 1 Thu Jan 19 14:38:57 2006 Successful mount of redo thread 1, with mount id 3945133236 Thu Jan 19 14:38:57 2006 Database mounted in Exclusive Mode Completed: ALTER DATABASE MOUNT Thu Jan 19 14:38:57 2006 ALTER DATABASE OPEN Thu Jan 19 14:38:57 2006 Beginning crash recovery of 1 threads parallel recovery started with 3 processes Thu Jan 19 14:38:58 2006 Started redo scan Thu Jan 19 14:38:58 2006 Completed redo scan 59 redo blocks read, 14 data blocks need recovery Thu Jan 19 14:38:58 2006 Started redo application at Thread 1: logseq 5834, block 72749 Thu Jan 19 14:38:58 2006 Recovery of Online Redo Log: Thread 1 Group 2 Seq 5834 Reading mem 0 Mem# 0 errs 0: +ORADG/danaly/onlinelog/group_2.260.600173851 Mem# 1 errs 0: +ORADG/danaly/onlinelog/group_2.261.600173853 Thu Jan 19 14:38:58 2006 Completed redo application Thu Jan 19 14:38:59 2006 Completed crash recovery at Thread 1: logseq 5834, block 72808, scn 16042818137 14 data blocks read, 14 data blocks written, 59 redo blocks read Thu Jan 19 14:39:00 2006 Thread 1 advanced to log sequence 5835 Thread 1 opened at log sequence 5835 Current log# 3 seq# 5835 mem# 0: +ORADG/danaly/onlinelog/group_3.262.600173855 Current log# 3 seq# 5835 mem# 1: +ORADG/danaly/onlinelog/group_3.263.600173857 Successful open of redo thread 1 Thu Jan 19 14:39:00 2006 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Thu Jan 19 14:39:00 2006 SMON: enabling cache recovery Thu Jan 19 14:39:01 2006 Successfully onlined Undo Tablespace 15. Thu Jan 19 14:39:01 2006 SMON: enabling tx recovery Thu Jan 19 14:39:01 2006 Database Characterset is ZHS16GBK replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC QMNC started with pid=22, OS id=3796 Thu Jan 19 14:39:09 2006 db_recovery_file_dest_size of 2048 MB is 16.75% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jan 19 14:39:09 2006 Completed: ALTER DATABASE OPEN
|
这样的错误在Metalink上都被当做Bug处理,很难找到真正的原因,如果不能再现,就只能做为一次异常处理了.
历史上的今天...
>>
2012-01-19文章:
>>
2009-01-19文章:
>>
2008-01-19文章:
>>
2007-01-19文章:
>>
2005-01-19文章:
By eygle on 2006-01-19 18:07 |
Comments (0) |
Oracle12c/11g | 645 |
ASMOracle