Monday, September 22, 2014

Troubleshooting OEM12c,

This  is quite natural  error at the time of configuring oms server / at the time of maintenance of server.
The behavior is something like below.
[oracle@oem12c bin]$./emctl status oms
WebTier is Up
Oracle Management Server is Down

even after we try couple of start/stop operations also, may not helpful in bring the oms up.

$./emctl start oms
Starting WebTier...
WebTier Successfully Started
Starting Oracle Management Server...
Oracle Management Server is Down



In emctl.log we can see the below errors
·         Failed to connect to em login page
·         Encountered error while hitting page. Status codes are 404 for console and 500 for PBS
·         Connection exception when trying to hit OMS page. OMS is down


So all you have to do is, need to follow a sequence to bring it up,
 ·         stop the oms repository forcibly
       emctl stop oms -force
·         Bring the repository database listener first.
·         Check the listener services
       lsncrctl services
·         Next bring the repository database up.
·         Check the connectivity for sysman user from sqlplus,  in case of any problems with the password/account lock resolve them .
·         And then start the oms .
       emctl start oms



Now it's succefully started and you are supposed to get acces to the login page.

Thanks for reading the posting...!







Tuesday, July 8, 2014

oracle 11gR2 RAC “oc4j” and “gsd” resource offline



Oracle RAC GSD process

GSD (Global Services Daemon) is a cluserware process that was used in Oracle RAC 9i.
It runs on each node and coordinates with the cluster manager to receive requests from clients such as the DBCA, EM and the SRVCTL utility to execute tasks such as instance startup or shutdown.

With Oracle Clusterware 10g and 11g, the GSD resource does not have a function and will be disabled by default
It can be enabled in Oracle 10g or 11g if any Oracle 9i RAC components are present.


[oracle@rac2 ~]$ crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    rac1        
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1       
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac2       
ora.asm        ora.asm.type   ONLINE    ONLINE    rac1       
ora.eons       ora.eons.type  ONLINE    ONLINE    rac1       
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE              
ora....network ora....rk.type ONLINE    ONLINE    rac1       
ora.oc4j       ora.oc4j.type  OFFLINE   OFFLINE              
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1       
ora.orcl.db    ora....se.type ONLINE    ONLINE    rac1       
ora....SM1.asm application    ONLINE    ONLINE    rac1       
ora....C1.lsnr application    ONLINE    ONLINE    rac1       
ora.rac1.gsd   application    OFFLINE   OFFLINE              
ora.rac1.ons   application    ONLINE    ONLINE    rac1       
ora.rac1.vip   ora....t1.type ONLINE    ONLINE    rac1       
ora....SM2.asm application    ONLINE    ONLINE    rac2       
ora....C2.lsnr application    ONLINE    OFFLINE              
ora.rac2.gsd   application    OFFLINE   OFFLINE              
ora.rac2.ons   application    ONLINE    ONLINE    rac2       
ora.rac2.vip   ora....t1.type ONLINE    ONLINE    rac2       
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac2   

OC4J Resource Troubleshooting
[oracle@rac1 ~]$ srvctl start oc4j
OC4J could not be started
PRCR-1079 : Failed to start resource ora.oc4j
CRS-2501: Resource 'ora.oc4j' is disabled
we need to enable and start the resource using
[oracle@rac1 ~]$ srvctl enable oc4j
[oracle@rac1 ~]$ srvctl start oc4j
Check the resource status
[oracle@rac1 ~]$ crs_stat -t|grep oc4j
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    rac1    

Now take a look at our GSD resource
[oracle@rac1 ~]$ srvctl status nodeapps
VIP rac1-vip is enabled
VIP rac1-vip is running on node: rac1
VIP rac2-vip is enabled
VIP rac2-vip is running on node: rac2
Network is enabled
Network is running on node: rac1
Network is running on node: rac2
GSD is disabled
GSD is not running on node: rac1
GSD is not running on node: rac2
ONS is enabled
ONS daemon is running on node: rac1
ONS daemon is running on node: rac2
eONS is enabled
eONS daemon is running on node: rac1
eONS daemon is running on node: rac2  
From the above output it is evident that gsd is disable so, enable that and start it 
[oracle@rac1 ~]$ srvctl enable nodeapps
PRKO-2414 : GSD is already enabled on node(s): rac1,rac2
PRKO-2415 : VIP is already enabled on node(s): rac1,rac2
PRKO-2416 : Network resource is already enabled.
PRKO-2417 : ONS is already enabled on node(s): rac1,rac2
PRKO-2418 : eONS is already enabled on node(s): rac1,rac2
[oracle@rac1 archive]$ srvctl start nodeapps
PRKO-2421 : Network resource is already started on node(s): rac1,rac2
PRKO-2420 : VIP is already started on node(s): rac1,rac2
PRKO-2420 : VIP is already started on node(s): rac1,rac2
PRKO-2422 : ONS is already started on node(s): rac1,rac2
PRKO-2423 : eONS is already started on node(s): rac1,rac2


Finally Check the resources status
[oracle@rac1 ~]$ crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora.DATA.dg    ora....up.type ONLINE    ONLINE    rac1       
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1       
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac2       
ora.asm        ora.asm.type   ONLINE    ONLINE    rac1       
ora.eons       ora.eons.type  ONLINE    ONLINE    rac1       
ora.gsd        ora.gsd.type   ONLINE    ONLINE    rac1       
ora....network ora....rk.type ONLINE    ONLINE    rac1       
ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    rac1       
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1       
ora.orcl.db    ora....se.type ONLINE    ONLINE    rac1       
ora....SM1.asm application    ONLINE    ONLINE    rac1       
ora....C1.lsnr application    ONLINE    ONLINE    rac1       
ora.rac1.gsd   application    ONLINE    ONLINE    rac1       
ora.rac1.ons   application    ONLINE    ONLINE    rac1       
ora.rac1.vip   ora....t1.type ONLINE    ONLINE    rac1       
ora....SM2.asm application    ONLINE    ONLINE    rac2       
ora....C2.lsnr application    ONLINE    ONLINE    rac2       
ora.rac2.gsd   application    ONLINE    ONLINE    rac2       
ora.rac2.ons   application    ONLINE    ONLINE    rac2       
ora.rac2.vip   ora....t1.type ONLINE    ONLINE    rac2       
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac2       

Saturday, June 7, 2014

Creating SQL_BASELINE on Problematic Queries.

Scenario : I have a sql query which is running in QA environment fine but it's not working properly in PROD environment.
Or else From years onwards query is working fine but started giving trouble from the last 2-3 days.

Well,  As a DBA everybody will face a scenario like this..!
For all this kind of scenarios the solution is creating SQL_BASELINE on those probematic SQL's.

But How...?
Here is the Answer :

Findout the snap intervals where your query is executed fine earlier from AWR using

select s.begin_interval_time, s.end_interval_time , q.snap_id, q.dbid, q.sql_id, q.plan_hash_value, q.optimizer_cost, q.optimizer_mode, q.CPU_TIME_TOTAL,q.BUFFER_GETS_TOTAL, q.DISK_READS_TOTAL
from dba_hist_sqlstat q, dba_hist_snapshot s
where q.sql_id in ('fjadcdbwun2q9')
and q.snap_id = s.snap_id
and s.begin_interval_time between sysdate-14 and sysdate
order by s.snap_id desc;

O/P :-
6/4/2014 3:00:53.428 PM        6/4/2014 4:00:56.263 PM        8487    4129595089    fjadcdbwun2q9    1051222664    200    ALL_ROWS    3810000        16581    954
6/4/2014 3:00:53.428 PM        6/4/2014 4:00:56.263 PM        8487    4129595089    fjadcdbwun2q9    345059072    200    ALL_ROWS    5306540000    32366    3897
6/4/2014 2:00:50.381 PM        6/4/2014 3:00:53.428 PM        8486    4129595089    fjadcdbwun2q9    345059072    200    ALL_ROWS    5061430000    29239    2367
6/4/2014 2:00:50.381 PM        6/4/2014 3:00:53.428 PM        8486    4129595089    fjadcdbwun2q9    1051222664    200    ALL_ROWS    3810000        16581    954
6/4/2014 1:00:47.588 PM        6/4/2014 2:00:50.381 PM        8485    4129595089    fjadcdbwun2q9    345059072    200    ALL_ROWS    2907840000    28955    2288
6/4/2014 1:00:47.588 PM        6/4/2014 2:00:50.381 PM        8485    4129595089    fjadcdbwun2q9    1051222664    200    ALL_ROWS    3810000        16581    954
6/4/2014 11:00:41.248 AM    6/4/2014 12:00:45.552 PM    8483    4129595089    fjadcdbwun2q9    1051222664    200    ALL_ROWS    3810000        16581    954
6/4/2014 11:00:41.248 AM    6/4/2014 12:00:45.552 PM    8483    4129595089    fjadcdbwun2q9    345059072    200    ALL_ROWS    1364880000    13303    1508
6/4/2014 10:00:37.885 AM    6/4/2014 11:00:41.248 AM    8482    4129595089    fjadcdbwun2q9    345059072    200    ALL_ROWS    235030000    2347    0
6/4/2014 10:00:37.885 AM    6/4/2014 11:00:41.248 AM    8482    4129595089    fjadcdbwun2q9    1051222664    200    ALL_ROWS    3810000        16581    954

Ensure your intervals :

Beging_Snap_ID  : 8486
End_Snap_ID        : 8487
SQL_ID            : fjadcdbwun2q9
PLAN_HASH_VALUE    : 1051222664 (which we are going to force now to use this value bcoz with in the same interval we have 2 Plan_values where one is best)

Step -1 :First create STS(sql tuning set) for loading plan into it.

SQL> BEGIN
DBMS_SQLTUNE.CREATE_SQLSET(
sqlset_name => 'MySTS01',
description => 'SQL Tuning Set for loading plan into SQL Plan Baseline');
END;
/

PL/SQL procedure successfully completed.

Step-2 : Populate our info into the STS using below

SQL> DECLARE
  2  cur sys_refcursor;
  3  BEGIN
  4  OPEN cur FOR
  5  SELECT VALUE(P)
  6  FROM TABLE(
  7  dbms_sqltune.select_workload_repository(begin_snap=>'8486', end_snap=>'8487',basic_filter=>'sql_id = ''fjadcdbwun2q9''',attribute_list=>'ALL')) p;
  8  DBMS_SQLTUNE.LOAD_SQLSET( sqlset_name=> 'MySTS01', populate_cursor=>cur);
  9  CLOSE cur;
 10  END;
 11  /

PL/SQL procedure successfully completed.


#####Crossverify the info Baseline created or not.
SELECT first_load_time ,executions as execs ,parsing_schema_name ,elapsed_time / 1000000 as elapsed_time_secs ,cpu_time / 1000000 as cpu_time_secs ,
buffer_gets ,
disk_reads ,
direct_writes ,
rows_processed ,
fetches ,
optimizer_cost ,
sql_plan ,
plan_hash_value ,
sql_id , sql_text  FROM TABLE(DBMS_SQLTUNE.SELECT_SQLSET(sqlset_name => 'MySTS01'));


O/P :-

    4    EROCKS    762.186098    245.11    3127    1530    1778850    0    4    200    (DATASET)    345059072    fjadcdbwun2q9    (HUGECLOB)
    0    EROCKS    0            0        0        0        0        0    0    200    (DATASET)    1051222664    fjadcdbwun2q9    (HUGECLOB)
#########


Step-3 :  So,Now from the above we got 2 diff plan values in those AWR snap intevals,out of which we are going to force one using
SQL> DECLARE
my_plans pls_integer;
BEGIN
my_plans := DBMS_SPM.LOAD_PLANS_FROM_SQLSET(
sqlset_name => 'MySTS01',
basic_filter=>'plan_hash_value = ''1051222664'''
);
END;

PL/SQL procedure successfully completed.

Again we need to cross verify two options

SELECT * FROM dba_sql_plan_baselines ;
SELECT enabled,accepted FROM dba_sql_plan_baselines ;

O/P :-

YES        YES

############################################
Cross verify the problematic query , which is  using our desired plan_hash value or not, using explain plan.







Sunday, April 13, 2014

Top 5 issues for Instance Eviction {Doc ID 1374110.1}




Issue #1  The alert.log shows ora-29740 as a reason for instance crash/eviction

Symptoms:

An instance crashes and the alert.log shows "ORA-29740: evicted by member ..." error.

Possible causes:

An ORA-29740 error occurs when an instance evicts another instance in a RAC database.  The instance that gets evicted reports ora-29740 error in the alert.log.
Some of the reasons for this are a communications error in the cluster, failure to issue a heartbeat to the control file, and other reasons. 

Checking the lmon trace files of all instances is very important to determine the reason code.  Look for the line with "kjxgrrcfgchk: Initiating reconfig".
This will give a reason code such as "kjxgrrcfgchk: Initiating reconfig, reason 3".  Most of the ora-29740 error when an instance is evicted is due to reason 3 which means "Communications Failure".

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

Solutions:

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
6)  Having OSWatcher output is helpful when CHM output is not available.

Issue #2  The alert.log shows "ipc send timeout" error before the instance crashes or is evicted

Symptoms:

An instance is evicted with alert.log showing many "IPC send timeout" errors.  This message normally accompanies a database performance problem.

Possible causes:

In RAC, processes like lmon, lmd, and lms processes constantly talk to processes in other instances.  The lmd0 process  is responsible for managing enqueues while lms processes are responsible for managing data block resources and transferring data blocks to support the cache fusion.  When one or more of these processes are stuck, spin, or are extremely busy with the load, then these processes can cause the "IPC send timeout" error.

Another cause of "IPC send timeout" error reported by lmon, lms, and lmd processes is the  network problem or the server resource (CPU and memory) issue.  Those processes may not get scheduled to run on CPU or the network packet sent by those processes can get lost.

The communication problem involving lmon, lmd, and lms processes causes an instance eviction.  The alert.log of the evicting instance shows messages similar to

IPC Send timeout detected.Sender: ospid 1519
Receiver: inst 8 binc 997466802 ospid 23309

If an instance is evicted, the "IPC Send timeout detected" in alert.log is normally followed by other issues like ora-29740 and "Waiting for clusterware split-brain resolution"

Solutions:

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
6)  Having OSWatcher output is helpful when CHM output is not available.

Issue #3  The problem instance was hanging before the instance crashes or is evicted

Symptoms:

The instance or database was hanging before the instance crashed/evicted.  It could also be that the node hang.

Possible causes:

Different processes such as lmon, lmd, and lms communicate with corresponding processes on other instances, so when the instance and database hang, those processes may be waiting for a resource such as a latch, an enqueue, or a data block.  Those processes that are waiting can not respond to the network ping or send any communication over the network to the remote instances.  As a result, other instances evict the problem instance.

You may see a message similar to the following in the alert.log of the instance that is evicting another instance:
Remote instance kill is issued [112:1]: 8
or
Evicting instance 2 from cluster

Solutions:

1) Find out the reason for the database or instance hang. Getting a global system state dump and global hang analyze output is critical when troubleshooting the database or instance hang issue. If the global system state dump can not be obtained, get the local system state dump from all instances around same time.
2) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
3) Having OSWatcher output is helpful when CHM output is not available.

Issue #4  The alert.log shows "Waiting for clusterware split-brain resolution" before one or more instances crashes or is evicted

Symptoms:

Before one of more instances crash, the alert.log shows "Waiting for clusterware split-brain resolution".  This is often followed by "Evicting instance n from cluster" where n is the instance number that is getting evicted.

Possible causes:

The lmon process sends a network ping to remote instances, and if lmon processes on the remote instances do not respond, a split brain at the instance level occurred.  Therefore, finding out the reason that the lmon can not communicate with each other is important in resolving this issue.

The common causes are:
1) The instance level split brain is frequently caused by the network problem, so checking the network setting and connectivity is important.  However, since the clusterware (CRS) would have failed if the network is down, the network is likely not down as long as both CRS and database use the same network.  
2) The server is very busy and/or the amount of free memory is low -- heavy swapping and scanning or memory will prevent lmon processes from getting scheduled. 
3) The database or instance is hanging and lmon process is stuck.
4) Oracle bug

The above causes are similar to the causes for the issue #1 (The alert.log shows ora-29740 as a reason for instance crash/eviction).

Solutions:

The solution in here is similar to issue #1.

1) Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.
2) Check network configuration to make sure that all network configurations are set up correctly on all nodes.
   For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.
3) Check if the server had a CPU load problem or a free memory shortage.
4) Check if the database was hanging or having a severe performance problem prior to the instance eviction.
5) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or  lms processes.
6)  Having OSWatcher output is helpful when CHM output is not available.

Issue #5  The problem instance is killed by CRS because another instance tried to evict the problem instance and could not evict it.

Symptoms:

When an instance evicts another instance, all instance waits until the problem instance shuts down itself, but if the problem instance does not terminate for any reason,
the same instance that initiated the eviction issues a member kill request.  The member kill request asks the CRS to kill the problem instance.  This feature is available from 11.1 and higher.

Possible causes:

The alert.log of the instance that is asking CRS to kill the problem instance shows
Remote instance kill is issued [112:1]: 8

For example, the above message means that the member kill request to kill the instance 8 is sent to CRS.

The problem instance is hanging for any reason and is not responsive.  This could be due to the node having CPU and memory problem, and the processes for the problem instance is not getting scheduled to run on CPU.

The second common cause is a severe contention in the database is preventing the problem instance from realizing that remote instances evicted the instance.

Another cause could be due to the one or more processes surviving the "shutdown abort" when the instance tries to abort itself.  Unless all processes for the instance is killed, CRS does not think the instance terminated and will not inform other instances that the problem instance aborted.  One common problem for this is that one or more processes become defunct processes and do not terminate.
This leads to the recycle of CRS either through a node reboot or a rebootless restart of CRS (node does not get rebooted but CRS gets restarted). 
In this case, the alert.log if the problem instance shows
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 23305

Solutions:

1) Find out the reason for the database or instance hang. Getting a global system state dump and global hang analyze output is critical when troubleshooting the database or instance hang issue. If the global system state dump can not be obtained, get the local system state dump from all instances around same time.
2) Check CHM (Cluster Health Monitor) output to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.
3) Having OSWatcher output is helpful when CHM output is not available.