Sunday, April 13, 2014

Top 5 issues for Instance Eviction {Doc ID 1374110.1}




Issue #1  The alert.log shows ora-29740 as a reason for instance crash/eviction

Symptoms:

An instance crashes and the alert.log shows "ORA-29740: evicted by member ..." error.

Possible causes:

An ORA-29740 error occurs when an instance evicts another instance in a RAC database.  The instance that gets evicted reports ora-29740 error in the alert.log.
Some of the reasons for this are a communications error in the cluster, failure to issue a heartbeat to the control file, and other reasons. 

Checking the lmon trace files of all instances is very important to determine the reason code.  Look for the line with "kjxgrrcfgchk: Initiating reconfig".
This will give a reason code such as "kjxgrrcfgchk: Initiating reconfig, reason 3".  Most of the ora-29740 error when an instance is evicted is due to reason 3 which means "Communications Failure".

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

Solutions:

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
6)  Having OSWatcher output is helpful when CHM output is not available.

Issue #2  The alert.log shows "ipc send timeout" error before the instance crashes or is evicted

Symptoms:

An instance is evicted with alert.log showing many "IPC send timeout" errors.  This message normally accompanies a database performance problem.

Possible causes:

In RAC, processes like lmon, lmd, and lms processes constantly talk to processes in other instances.  The lmd0 process  is responsible for managing enqueues while lms processes are responsible for managing data block resources and transferring data blocks to support the cache fusion.  When one or more of these processes are stuck, spin, or are extremely busy with the load, then these processes can cause the "IPC send timeout" error.

Another cause of "IPC send timeout" error reported by lmon, lms, and lmd processes is the  network problem or the server resource (CPU and memory) issue.  Those processes may not get scheduled to run on CPU or the network packet sent by those processes can get lost.

The communication problem involving lmon, lmd, and lms processes causes an instance eviction.  The alert.log of the evicting instance shows messages similar to

IPC Send timeout detected.Sender: ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309

If an instance is evicted, the "IPC Send timeout detected" in alert.log is normally followed by other issues like ora-29740 and "Waiting for clusterware split-brain resolution"

Solutions:

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
6)  Having OSWatcher output is helpful when CHM output is not available.

Issue #3  The problem instance was hanging before the instance crashes or is evicted

Symptoms:

The instance or database was hanging before the instance crashed/evicted.  It could also be that the node hang.

Possible causes:

Different processes such as lmon, lmd, and lms communicate with corresponding processes on other instances, so when the instance and database hang, those processes may be waiting for a resource such as a latch, an enqueue, or a data block.  Those processes that are waiting can not respond to the network ping or send any communication over the network to the remote instances.  As a result, other instances evict the problem instance.

You may see a message similar to the following in the alert.log of the instance that is evicting another instance:
Remote instance kill is issued [112:1]: 8
or
Evicting instance 2 from cluster

Solutions:

1) Find out the reason for the database or instance hang. Getting a global system state dump and global hang analyze output is critical when troubleshooting the database or instance hang issue. If the global system state dump can not be obtained, get the local system state dump from all instances around same time.
2) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
3) Having OSWatcher output is helpful when CHM output is not available.

Issue #4  The alert.log shows "Waiting for clusterware split-brain resolution" before one or more instances crashes or is evicted

Symptoms:

Before one of more instances crash, the alert.log shows "Waiting for clusterware split-brain resolution".  This is often followed by "Evicting instance n from cluster" where n is the instance number that is getting evicted.

Possible causes:

The lmon process sends a network ping to remote instances, and if lmon processes on the remote instances do not respond, a split brain at the instance level occurred.  Therefore, finding out the reason that the lmon can not communicate with each other is important in resolving this issue.

The common causes are:
1) The instance level split brain is frequently caused by the network problem, so checking the network setting and connectivity is important.  However, since the clusterware (CRS) would have failed if the network is down, the network is likely not down as long as both CRS and database use the same network.  
2) The server is very busy and/or the amount of free memory is low -- heavy swapping and scanning or memory will prevent lmon processes from getting scheduled. 
3) The database or instance is hanging and lmon process is stuck.
4) Oracle bug

The above causes are similar to the causes for the issue #1 (The alert.log shows ora-29740 as a reason for instance crash/eviction).

Solutions:

The solution in here is similar to issue #1.

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or  lms processes.
6)  Having OSWatcher output is helpful when CHM output is not available.

Issue #5  The problem instance is killed by CRS because another instance tried to evict the problem instance and could not evict it.

Symptoms:

When an instance evicts another instance, all instance waits until the problem instance shuts down itself, but if the problem instance does not terminate for any reason,
the same instance that initiated the eviction issues a member kill request.  The member kill request asks the CRS to kill the problem instance.  This feature is available from 11.1 and higher.

Possible causes:

The alert.log of the instance that is asking CRS to kill the problem instance shows
Remote instance kill is issued [112:1]: 8

For example, the above message means that the member kill request to kill the instance 8 is sent to CRS.

The problem instance is hanging for any reason and is not responsive.  This could be due to the node having CPU and memory problem, and the processes for the problem instance is not getting scheduled to run on CPU.

The second common cause is a severe contention in the database is preventing the problem instance from realizing that remote instances evicted the instance.

Another cause could be due to the one or more processes surviving the "shutdown abort" when the instance tries to abort itself.  Unless all processes for the instance is killed, CRS does not think the instance terminated and will not inform other instances that the problem instance aborted.  One common problem for this is that one or more processes become defunct processes and do not terminate.
This leads to the recycle of CRS either through a node reboot or a rebootless restart of CRS (node does not get rebooted but CRS gets restarted). 
In this case, the alert.log if the problem instance shows
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 23305

Solutions:

1) Find out the reason for the database or instance hang. Getting a global system state dump and global hang analyze output is critical when troubleshooting the database or instance hang issue. If the global system state dump can not be obtained, get the local system state dump from all instances around same time.
2) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
3) Having OSWatcher output is helpful when CHM output is not available.