Blog dedicated to Oracle Applications (E-Business Suite) Technology; covers Apps Architecture, Administration and third party bolt-ons to Apps

Tuesday, December 22, 2015

Calling all Apps DBAs doing 11i to R12.x upgrades

At this time of the year during holidays, the Apps DBA community is busy doing upgrades as longer downtimes are possible.  In case you are facing any issues, please feel free to write to me at my email: oracleappstechnology@gmail.com .  I will be glad to hear from you and help you.

Wednesday, December 16, 2015

11i pre-upgrade data fix script ap_wrg_11i_chrg_alloc_fix.sql runs very slow

We are currently upgrading one of our ERP instances from 11.5.10.2 to R12.2.5.  One of the pre-upgrade steps is to execute the data fix script ap_wrg_11i_chrg_alloc_fix.sql.  However, this script has been running very very slow. After 4 weeks of monitoring, logging SRs with Oracle, escalating etc., we started a group chat today with our internal experts.  We had Ali, Germaine, Aditya, Mukhtiar, Martha Gomez and Zoltan.  I also invited our top notch EBS Techstack expert John Felix. After doing explain plan on the sql, Based on the updates being done by the query I predicted that it will take 65 days to complete.

John pointed out that the query was using the index AP_INVOICE_DISTRIBUTIONS_N4  that had a very high cost.  We used an sql profile that replaced AP_INVOICE_DISTRIBUTIONS_N4  with AP_INVOICE_DISTRIBUTIONS_U1.  The query started running faster and my new prediction was that it would complete in 5.45 days.

John mentioned that now another select statement was using the same index AP_INVOICE_DISTRIBUTIONS_N4 that had a very high cost.

After discussing among ourselves, we decided to drop the index, run the script and re-create the index. Aditya saved the definition of the index and dropped it.

DBMS_METADATA.GET_DDL('INDEX','AP_INVOICE_DISTRIBUTIONS_N4','AP')
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  CREATE INDEX "AP"."AP_INVOICE_DISTRIBUTIONS_N4" ON "AP"."AP_INVOICE_DISTRIBUTIONS_ALL" ("ACCOUNTING_DATE")
  PCTFREE 10 INITRANS 11 MAXTRANS 255 COMPUTE STATISTICS
  STORAGE(INITIAL 131072 NEXT 131072 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "APPS_TS_TX_IDX"

1 row selected.

SQL> drop index AP.AP_INVOICE_DISTRIBUTIONS_N4;

Index dropped.

The updates started happening blazing fast.  The whole thing got done in 39 minutes and we saw the much awaited:

SQL> set time on
16:34:16 SQL> @ap_wrg_11i_chrg_alloc_fix.sql
Enter value for resp_name: Payables Manager
Enter value for usr_name: 123456
-------------------------------------------------------------------------------
/erp11i/applcsf/temp/9570496-fix-16:34:40.html is the log file created
-------------------------------------------------------------------------------

PL/SQL procedure successfully completed.

17:13:36 SQL>

From 65 days to 5.45 days to 39 minutes.  Remarkable.  Thank you John for your correct diagnosis and solution.

Monday, November 16, 2015

sqlplus core dumps with segmentation fault error in OEL 6.6 when you connect to DB

We have used OEL 6.6 image in our latest build.  When we cloned an EBS R12.2 instance that was on OEL 5.7 to this new server that has OEL 6.6, During the clone, adcfgclone.pl was failing. On further checks, we discovered that sqlplus is crashing with segmentation fault error whenever we tried to connect to database:

sqlplus /nolog
conn apps/apps
Segmentation Fault

So, I suggested the DBAs to do strace sqlplus apps/apps.  The strace revealed many missing libraries:

We had another working OEL 6.4 instance where we checked for these libraries, and all of them were present.

The locate command was used to locate the full directory paths of the missing libraries

locate libnss_sss.so.2
/lib/libnss_sss.so.2

/lib/libnss_sss.so.2
/lib/libnss_files.so.2
/lib/libociei.so
/lib/libc.so.6
/lib/libgcc_s.so.1
/lib/libnsl.so.1
/lib/libpthread.so.0

Then rpm -qf command was used to find out the rpm that would have the library:

$ rpm -qf /lib/libnss_sss.so.2
sssd-client-1.11.6-30.el6_6.3.i686
$ rpm -qf /lib/libnss_files.so.2
glibc-2.12-1.149.el6_6.9.i686
$ rpm -qf /lib/libociei.so
error: file /lib/libociei.so: No such file or directory
$ rpm -qf /lib/libc.so.6
glibc-2.12-1.149.el6_6.9.i686
$ rpm -qf /lib/libgcc_s.so.1
libgcc-4.4.7-3.el6.i686
$ rpm -qf /lib/libnsl.so.1
glibc-2.12-1.149.el6_6.9.i686
$ rpm -qf /lib/libpthread.so.0
glibc-2.12-1.149.el6_6.9.i686
$ rpm -qf /lib/libm.so.6
glibc-2.12-1.149.el6_6.9.i686
$ rpm -qf /lib/libdl.so.2
glibc-2.12-1.149.el6_6.9.i686

Since 10.1.2 home is 32-bit in EBS R12.1 and 12.2, all the libraries needed to be 32-bit.

Except for sssd-client, the other rpms were present.  64-bit version of sssd-client was present and whenver we tried to install the 32-bit rpm it would give this error, as the operating system thinks that it is already installed:

# yum install sssd-client.i686
Loaded plugins: security
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package sssd-client.i686 0:1.12.4-47.el6 will be installed
--> Finished Dependency Resolution
Error:  Multilib version problems found. This often means that the root
       cause is something else and multilib version checking is just
       pointing out that there is a problem. Eg.:

         1. You have an upgrade for sssd-client which is missing some
            dependency that another package requires. Yum is trying to
            solve this by installing an older version of sssd-client of the
            different architecture. If you exclude the bad architecture
            yum will tell you what the root cause is (which package
            requires what). You can try redoing the upgrade with
            --exclude sssd-client.otherarch ... this should give you an error
            message showing the root cause of the problem.

         2. You have multiple architectures of sssd-client installed, but
            yum can only see an upgrade for one of those arcitectures.
            If you don't want/need both architectures anymore then you
            can remove the one with the missing update and everything
            will work.

         3. You have duplicate versions of sssd-client installed already.
            You can use "yum check" to get yum show these errors.

       ...you can also use --setopt=protected_multilib=false to remove
       this checking, however this is almost never the correct thing to
       do as something else is very likely to go wrong (often causing
       much more problems).

       Protected multilib versions: sssd-client-1.12.4-47.el6.i686 != sssd-client-1.11.6-30.el6_6.4.x86_64


# rpm -qa | grep sssd-client
sssd-client-1.11.6-30.el6_6.4.x86_64

Eventually we installed it with force option

# rpm -Uvh --force /tmp/sssd-client-1.11.6-30.el6_6.3.i686.rpm

# rpm -qa | grep sssd-client
sssd-client-1.11.6-30.el6_6.3.i686
sssd-client-1.11.6-30.el6_6.4.x86_64

pam-ldap was one of the other rpms that was installed for other missing libraries.  Surprisingly, sssd-client and pam-ldap rpms are not mentioned as pre-requisites in support.oracle.com article:
Oracle E-Business Suite Installation and Upgrade Notes Release 12 (12.2) for Linux x86-64 (Doc ID 1330701.1) 

vnc xterm window appears without header

The program twm is "the window manager" that is responsible for the window header that appears on top of terminal windows of xterm.  In newer OEL 6 builds, when xterm is launched, the header doesn't appear as twm fails to launch.  If you try launching twm, it gives this error and exits to unix prompt:

twm: unable to open fontset "-adobe-helvetica-bold-r-normal--*-120-*-*-*-*-*-*"

I found a solution on http://ubuntuforums.org/archive/index.php/t-1596636.html :

It was reported here for fedora: https://bugzilla.redhat.com/show_bug.cgi?id=509639. The workaround is to execute it with a specific shell variable:

$ DISPLAY=:vnc port number
$ export DISPLAY
$ xhost +
$ LANG=C
$ export LANG
twm &

twm launches fine after this.

Saturday, November 14, 2015

Oracle SSO Failure - Unable to process request Either the requested URL was not specified in terms of a fully-qualified host name or OHS single sign-on is incorrectly configured

Today, during a cutover when we were moving one of our ERP instance on Cisco UCS VMware VMs to Exalogic and Exadata, I got a call from Bimal.  The extranet iSupplier URL had been configured, but whenever any user logged in, they were seeing the following error instead of the iSupplier OAF Home page:

Oracle SSO Failure - Unable to process request Either the requested URL was not specified in terms of a fully-qualified host name or OHS single sign-on is incorrectly configured

A search on support.oracle.com showed many hits.  I went through a few of them and ruled out the solutions given. This article sounded promising: Oracle SSO Failure - Unable to process request Either the requested URL was not specified in terms of a fully-qualified host name or OHS single sign-on is incorrectly configured (Doc ID 1474474.1).

The solution suggested:

There is  a hardware load-balancer for a multi-tier environment on place, as well as an SSL accelerator.

     For R12, there is a context variable, s_enable_sslterminator, that was set to "#".

     This should be null for e-Business R12 using specific hardwarementioned before.


1. Set  context variable, s_enable_sslterminator to null,

2. Re-ran autoconfig,

3. Re-test Single sign-ons via IE and Firefox now works as expected.

I asked the DBAs to check the value of s_enable_sslterminator:

grep s_enable_sslterminator

and sure enough the value was #

As per article Enabling SSL or TLS in Oracle E-Business Suite Release 12 (Doc ID 376700.1), the value of s_enable_sslterminator should be made null if you are using an SSL accelerator.  In our case we use SSL certificate on the Load Balancer and never on Web servers.

The DBAs removed the #
Ran autoconfig
Deregistered SSO
Registered SSO

The user was able to login after that.



Wednesday, October 21, 2015

How To Install Latest Verisign G5 Root Certificates

Dhananjay pinged me today and told me that for their Paypal integration, they had to upgrade to Verisign G5 root certificate.  This was the message from Paypal:

Global security threats are constantly changing, and the security of our merchants continues to be our highest priority. To guard against current and future threats, we are encouraging our merchants to make the following upgrades to their integrations:
  1. Update your integration to support certificates using the SHA-256 algorithm. PayPal is upgrading SSL certificates on all Live and Sandbox endpoints from SHA-1 to the stronger and more robust SHA-256 algorithm.
  2. Discontinue use of the VeriSign G2 Root Certificate. In accordance with industry standards, PayPal will no longer honor secure connections that require the VeriSign G2 Root Certificate for trust validation. Only secure connection requests that are expecting our certificate/trust chain to be signed by the G5 Root Certificate will result in successful secure connections.
For detailed information on these changes, please reference the Merchant Security System Upgrade Guide. For a basic introduction to internet security, we also recommend these short videos on SSL Certificates and Public Key Cryptography.

There is a support.oracle.com article published on October 16, 2015 which has detailed steps for 11i and R12.1:

How To Install Latest Verisign Root Certificates For Use With Paypal SDK 4.3.X (Doc ID 874433.1)

The Verisign G5 root certificate can be downloaded from:

Paypal Microsite about this change: https://www.paypal-knowledge.com/infocenter/index?page=content&id=FAQ1766&expand=true&locale=en_US

Useful Links

Friday, October 9, 2015

sftp failure due to newline character difference between windows and unix.

Recently I spent almost a full day struggling to make out, why an sftp connection would not work without password, after setting up ssh equivalence.  The keys were correct, the permissions on the directories were correct.  The authorized_keys file looked ok.  I copied the authorized_keys file of another account that was working fine.  When I replaced the authorized_keys after taking backup of original authorized_keys, it started working.  So then I proceeded to check the contents in a hex editor


On the left side you have the authorized_keys file created in Windows.
On the right side you have the same authorized_keys file created in Unix.

If you notice the ends of the lines in the Windows file it shows CR LF, where as unix shows LF.

This difference is well described in the wikipedia article on newline character.

The one mistake I had done this time was create the authorized_keys file in Windows notepad, as I was teaching a Developer how to create authorized_keys file.  Once I used vi on unix to create the authorized_keys file and pasted the same ssh key, sftp started working without prompting for password.  I know that Windows/DOS and Unix have different newline characters.  However, I was not able to apply that knowledge, till I compared the files in hex editor.

Whenever, a techie is able to get to the root cause of a problem, a deep sense of satisfaction is experienced.  I am glad I got the opportunity to troubleshoot and fix the issue by getting to the root cause of the issue.

Tuesday, September 15, 2015

Copycat blog

While doing a google search today I noticed that there is another blog that has copied all content from my blog and posted it as their own content and even kept a similar sounding name: http://oracleapps-technology.blogspot.com .  I have made a DMCA complaint to google about this.  The google team asked me to provide a list of URLs.  I had to go through the copycat's whole blog and create a spreadsheet with two columns. One column with URL of my original post and second column with the URL of the copycat's blog.  There were 498 entries.  I patiently did it and sent the spreadsheet to google team and got a reply within 2 hours:


Hello,
In accordance with the Digital Millennium Copyright Act, we have completed processing your infringement notice. We are in the process of disabling access to the content in question at the following URL(s):

http://oracleapps-technology.blogspot.com/

The content will be removed shortly.

Regards,
The Google Team 

Monday, June 22, 2015

Server refused public-key signature despite accepting key!

A new SFTP connection was not working, even though everything looked fine:

1. Permissions were correct on directories:
chmod go-w $HOME/
chmod 700 $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chmod 600 $HOME/.ssh/id_rsa
chmod 644 $HOME/.ssh/id_rsa.pub
chmod 644 $HOME/.ssh/known_hosts

2. Keys were correctly placed

However, it still asked for password, whenever SFTP connection was done:

Using username "sftpuser".
Authenticating with public key "rsa-key-20150214"
Server refused public-key signature despite accepting key!
Using keyboard-interactive authentication.
Password:

I tried various things, none worked and I eventually went back to my notes for SFTP troubleshooting:

1. Correct Permissions
chmod go-w $HOME/
chmod 700 $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
chmod 600 $HOME/.ssh/id_rsa
chmod 644 $HOME/.ssh/id_rsa.pub
chmod 644 $HOME/.ssh/known_hosts

2. Make sure the owner:group on the directories and files is correct:

ls -ld  $HOME/
ls -ld  $HOME/.ssh
ls -ltr $HOME/.ssh

3. Login as root

chown user:group $HOME 
chown user:group $HOME/.ssh
chown user:group $HOME/.ssh/authorized_keys
chown user:group $HOME/.ssh/id_rsa
chown user:group $HOME/.ssh/id_rsa.pub
chown user:group $HOME/.ssh/known_hosts

4. Check for user entries in /etc/passwd and /etc/shadow

5. grep user /etc/shadow

When I did the 5th step, I found that /etc/shadow entry for the user didn't exist.  So I did these steps:

chmod 600 /etc/shadow
vi /etc/shadow
Insert this new line at the end
sftpuser:UP:::::::
Save File
chmod 400 /etc/shadow

It started working after that.

Wednesday, May 13, 2015

java.sql.SQLException: Invalid number format for port number

Jim pinged me with this error today:

on ./adgendbc.sh i get
Creating the DBC file...
java.sql.SQLRecoverableException: No more data to read from socket raised validating GUEST_USER_PWD
java.sql.SQLRecoverableException: No more data to read from socket
Updating Server Security Authentication
java.sql.SQLException: Invalid number format for port number
Database connection to jdbc:oracle:thin:@host_name:port_number:database failed
to this point, this is what i've tried.
clean, autoconfid on db tier, autoconfig on cm same results
bounced db and listener.. same thing.. nothing i've done has made a difference

I noticed that when this error was coming the DB alert log was showing:

Wed May 13 18:50:51 2015
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x8] [PC:0x10A2FFB
C8, joet_create_root_thread_group()+136] [flags: 0x0, count: 1]
Errors in file /r12.1/admin/diag/rdbms/erp/erp/trace/erp_ora_14528.trc  (incident=1002115):
ORA-07445: exception encountered: core dump [joet_create_root_thread_group()+136
] [SIGSEGV] [ADDR:0x8] [PC:0x10A2FFBC8] [Address not mapped to object] []
Incident details in: /r12.1/admin/diag/rdbms/erp/erp/incident/incdir_1002115/erp_ora_14528_i1002115.trc

Metalink search revealed this article:

Java Stored Procedure Fails With ORA-03113 And ORA-07445[JOET_CREATE_ROOT_THREAD_GROUP()+145] (Doc ID 1995261.1)

It seems that the post patch steps for a PSU OJVM patch were not done.  We followed the steps given in above note were note completed. We completed these and adgendbc.sh completed successfully after that.


1.set the following init parameters so that JIT and job process do not start.

If spfile is used:

SQL> alter system set java_jit_enabled = FALSE;
SQL> alter system set "_system_trig_enabled"=FALSE;
SQL> alter system set JOB_QUEUE_PROCESSES=0;

2. Startup instance in restricted mode and run postinstallation step.

SQL> startup restrict

3.Run the postinstallation steps of OJVM PSU(Step 3.3.2 from readme)
Postinstallation
The following steps load modified SQL files into the database. For an Oracle RAC environment, perform these steps on only one node.
  1. Install the SQL portion of the patch by running the following command. For an Oracle RAC environment, reload the packages on one of the nodes.
2. cd $ORACLE_HOME/sqlpatch/19282015
3. sqlplus /nolog
4. SQL> CONNECT / AS SYSDBA
5. SQL> @postinstall.sql
  1. After installing the SQL portion of the patch, some packages could become INVALID. This will get recompiled upon access or you can run utlrp.sql to get them back into a VALID state.
7. cd $ORACLE_HOME/rdbms/admin
8. sqlplus /nolog
9. SQL> CONNECT / AS SYSDBA
SQL> @utlrp.sql


4. Reset modified init parameters

SQL> alter system set java_jit_enabled = true;
SQL> alter system set "_system_trig_enabled"=TRUE;
SQL> alter system set JOB_QUEUE_PROCESSES=10;
        -- or original JOB_QUEUE_PROCESSES value

5.Restart instance as normal
6.Now execute the Java stored procedure.


Ran adgendbc.sh and it worked fine.

Wednesday, April 29, 2015

R12.2 Single file system

With the release of AD and TXK Delta 6, Oracle has provided the feature of single file system on development instances for R12.2. Here's what they have mentioned in support.oracle.com article: Oracle E-Business Suite Applications DBA and Technology Stack Release Notes for R12.AD.C.Delta.6 and R12.TXK.C.Delta.6 (Doc ID 1983782.1)
Enhancements in AD and TXK Delta 6

4. New and Changed Features

Oracle E-Business Suite Technology Stack and Oracle E-Business Suite Applications DBA contain the following new or changed features in R12.AD.C.Delta.6 and R12.TXK.C.Delta.6.

4.1 Support for single file system development environments

  • A normal Release 12.2 online patching environment requires one application tier file system for the run edition, and another for the patch edition. This dual file system architecture is fundamental to the patching of Oracle E-Business Suite Release 12.2 and is necessary for production environments and test environments that are meant to be representative of production. This enhancement makes it possible to have a development environment with a single file system, where custom code can be built and tested. A limited set of adop phases and modes are available to support downtime patching of such a development environment. Code should then be tested in standard dual file system test environments before being applied to production.
More details are provided in Oracle E-Business Suite Maintenance Guide, Chapter: Patching Procedures):
http://docs.oracle.com/cd/E26401_01/doc.122/e22954/T202991T531065.htm#6169002 

Support for Single File System Development Environments
A normal Release 12.2 online patching environment requires two application tier file systems, one for the run edition and another for the patch edition. This dual file system architecture is fundamental to patching of Oracle E-Business Suite Release 12.2, and is necessary both for production environments and test environments that are intended to be representative of production. This feature makes it possible to create a development environment with a single file system, where custom code can be built and tested. The code should then always be tested in a standard dual file system test environment before being applied to production.
You can set up a single file system development environment by installing Oracle E-Business Suite Release 12.2 in the normal way, and then deleting the $PATCH_BASE directory with the command:
$ rm -rf $PATCH_BASE
A limited set of adop phases and modes are available to support patching of a single file system development environment. These are:
·         apply phase in downtime mode
·         cleanup phase
Specification of any other phase or mode will cause adop to exit with an error.
The following restrictions apply to using a single file system environment:
·         You can only use a single file system environment for development purposes.
·         You cannot use online patching on a single file system environment.
·         You can only convert an existing dual file system environment to a single file system: you cannot directly create a single file system environment via Rapid Install or cloning.
·         There is no way to convert a single file system environment back into a dual file system.

·         You cannot clone from a single file system environment.

Wednesday, April 15, 2015

You Are Trying To Access a Page That Is No Longer Active.The Referring Page May Have Come From a Previous Session. Please Select Home To Proceed

Shahed pinged me about this error.  It was coming after logging in.  This R12.1.3 instance had just migrated from an old server to a new one. Once you logged in this error would be displayed:

You Are Trying To Access a Page That Is No Longer Active.The Referring Page May Have Come From a Previous Session. Please Select Home To Proceed

The hits on support.oracle.com were not helpful, but a gave a clue that it may have something to do with session cookie.  So I used Firefox to check http headers.  If you press Ctrl+Shift+K, you will get a panel at the bottom of the browser. Click on Network tab, click on the AppsLocalLogin.jsp and on the right side of the pane, you'll see a cookie tab.

The domain appearing in the cookie tab was from the old server.  So I checked:

select session_cookie_domain from icx_parameters;
olddomain.justanexample.com

So I nullified it:

update icx_parameters set session_cookie_domain=null;

commit;

Restarted Apache

cd $ADMIN_SCRIPTS_HOME
adapcctl.sh stop
adapcctl.sh start

No more error.  I was able to log in and so was Shahed.

Chrome and E-Business Suite

Dhananjay came to me today.  He said that his users were complaining about forms not launching after upgrading to the latest version of Chrome. On launching forms they got this error:

/dev60cgi/oracle forms engine Main was not found on this server

I recalled that Google Chrome team had announced that they would not support java going forward. Googling with keywords chrome java brought this page:

https://java.com/en/download/faq/chrome.xml#npapichrome

It states that:

NPAPI support by Chrome

The Java plug-in for web browsers relies on the cross platform plugin architecture NPAPI, which has long been, and currently is, supported by all major web browsers. Google announced in September 2013 plans to remove NPAPI support from Chrome by "the end of 2014", thus effectively dropping support for Silverlight, Java, Facebook Video and other similar NPAPI based plugins. Recently, Google has revised their plans and now state that they plan to completely remove NPAPI by late 2015. As it is unclear if these dates will be further extended or not, we strongly recommend Java users consider alternatives to Chrome as soon as possible. Instead, we recommend Firefox, Internet Explorer and Safari as longer-term options. As of April 2015, starting with Chrome Version 42, Google has added an additional step to configuring NPAPI based plugins like Java to run — see the section Enabling NPAPI in Chrome Version 42 and later below.

Enabling NPAPI in Chrome Version 42 and later

As of Chrome Version 42, an additional configuration step is required to continue using NPAPI plugins.
  1. In your URL bar, enter:
    chrome://flags/#enable-npapi 
  2. Click the Enable link for the Enable NPAPI configuration option.
  3. Click the Relaunch button that now appears at the bottom of the configuration page.
Developers and System administrators looking for alternative ways to support users of Chrome should see this blog, in particular "Running Web Start applications outside of a browser" and "Additional Deployment Options" section.
Once Dhananjay did the above steps, Chrome started launching forms again.  He quickly gave these steps to all his users who had upgraded to the latest version of Chrome (version 42) and it started working form them too.
Oracle doesn't certify E-Business Suite forms on Chrome.  Only self service pages of E-Business Suite are certified on Google Chrome.

Saturday, April 11, 2015

opatch hangs on /sbin/fuser oracle

Pipu pinged me today about opatch hanging. The opatch log showed this:

[Apr 11, 2015 5:24:13 PM]    Start fuser command /sbin/fuser $ORACLE_HOME/bin/oracle at Sat Apr 11 17:24:13 EDT 2015

I had faced this issue once before, but was not able to recall what was the solution.  So I started fresh.

As oracle user:

/sbin/fuser $ORACLE_HOME/bin/oracle hung

As root user

/sbin/fuser $ORACLE_HOME/bin/oracle hung

As root user

lsof hung.

Google searches about it brought up a lot of hits about NFS issues.  So I did df -h.

df -h also hung.

So I checked /var/log/messages and found many messages like these:

Apr 11 19:44:42 erpserver kernel: nfs: server share.justanexample.com not responding, still trying

That server has a mount called /R12.2stage that has the installation files for R12.2.

So I tried unmounting it:

umount /R12.2stage
Device Busy

umount -f /R12.2stage
Device Busy

umount -l /R12.2stage

df -h didn't hang any more.

Next I did strace /sbin/fuser $ORACLE_HOME/bin/oracle and it stopped here:

open("/proc/12854/fdinfo/3", O_RDONLY)  = 7
fstat(7, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b99de014000
read(7, "pos:\t0\nflags:\t04002\n", 1024) = 20
close(7)                                = 0
munmap(0x2b99de014000, 4096)            = 0
getdents(4, /* 0 entries */, 32768)     = 0
close(4)                                = 0
stat("/proc/12857/", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/12857/stat", O_RDONLY)      = 4
read(4, "12857 (bash) S 12853 12857 12857"..., 4096) = 243
close(4)                                = 0
readlink("/proc/12857/cwd", "11.2.0.4/examples (deleted)"..., 4096) = 27
rt_sigaction(SIGALRM, {0x411020, [ALRM], SA_RESTORER|SA_RESTART, 0x327bc30030}, {SIG_DFL, [ALRM], SA_RESTORER|SA_RESTART, 0x327bc30030}, 8) = 0
alarm(15)                               = 0
write(5, "@\20A\0\0\0\0\0", 8)          = 8
write(5, "\20\0\0\0", 4)                = 4
write(5, "/proc/12857/cwd\0", 16)       = 16
write(5, "\220\0\0\0", 4)               = 4
read(6,  

It stopped here. So I did Ctrl+C
# ps -ef |grep 12857
oracle   12857 12853  0 Apr10 pts/2    00:00:00 -bash
root     21688  2797  0 19:42 pts/8    00:00:00 grep 12857

Killed this process

# kill -9 12857

Again I did strace /sbin/fuser $ORACLE_HOME/bin/oracle and it stopped at a different process this time that was another bash process.  I killed that process also.

I executed it for 3rd time: strace /sbin/fuser $ORACLE_HOME/bin/oracle

This time it completed.

Ran it without strace

/sbin/fuser $ORACLE_HOME/bin/oracle

It came out in 1 second.

Then I did the same process for lsof

strace lsof

and killed those processes were it was getting stuck.  Eventually lsof also worked.

Pipu retried opatch and it worked fine.

Stale NFS mount was the root cause of this issue.  It was stale because the source server was down for Unix security patching during weekend.
 

Friday, April 3, 2015

adoafmctl.sh hangs

Rajesh and Shahed called me about this error where after a reboot of the servers, adoafmctl.sh wouldn't start.  It gave errors like these:

You are running adoafmctl.sh version 120.6.12000000.3 
Starting OPMN managed OAFM OC4J instance ... 
adoafmctl.sh: exiting with status 152 
adoafmctl.sh: check the logfile 
$INST_TOP/logs/appl/admin/log/adoafmctl.txt for more information

adoafmctl.txt showing:
ias-component/process-type/process-set:
default_group/oafm/default_group/
Error
--> Process (index=1,uid=349189076,pid=15039)
time out while waiting for a managed process to start
Log:
$INST_TOP/logs/ora/10.1.3/opmn/default_group~oafm~default_group~1
07/31/09-09:14:28 :: adoafmctl.sh: exiting with status 152
================================================================================
07/31/09-09:14:40 :: adoafmctl.sh version 120.6.12000000.3
07/31/09-09:14:40 :: adoafmctl.sh: Checking the status of OPMN managed OAFM OC4J instance
Processes in Instance: SID_machine.machine.domain
-------------------+--------------------+---------+---------
ias-component | process-type | pid | status
-------------------+--------------------+---------+---------
default_group | oafm | N/A | Down

Solution:

1. Shutdown all Middle tier services and ensure no defunct processes exist running the following from the operating system:
# ps -ef | grep

If one finds any, kill these processes.
2. Navigate to $INST_TOP/ora/10.1.3/opmn/logs/states directory. It contains hidden file .opmndat:
# ls -lrt .opmndat
3. Delete this file .opmndat after making a backup of it:
# rm .opmndat
4. Restart the services.

5. Re-test the issue.

This resolved the issue.

Monday, March 23, 2015

R12.2 Documentation link in html format

This link has the R12.2 documentation in HTML format:

https://docs.oracle.com/cd/E26401_01/index.htm 

Sunday, March 1, 2015

The EBS Technology Codelevel Checker (available as Patch 17537119) needs to be run on the following nodes

I got this error while upgrading an R12.1.3 instance to R12.2.4, when I completed AD.C.Delta 5 patches with November 2014 bundle patches for AD.C and was in the process of applying TXK.C.Delta5 with November 2014 bundle patches for TXK.C :

Validation successful. All expected nodes are listed in ADOP_VALID_NODES table.
[START 2015/03/01 04:53:16] Check if services are down
        [INFO] Run admin server is not down
     [WARNING]  Hotpatch mode should only be used when directed by the patch readme.
  [EVENT]     [START 2015/03/01 04:53:17] Performing database sanity checks
    [ERROR]     The EBS Technology Codelevel Checker (available as Patch 17537119) needs to be run on the following nodes: .
    Log file: /erppgzb1/erpapp/fs_ne/EBSapps/log/adop/adop_20150301_045249.log


[STATEMENT] Please run adopscanlog utility, using the command

"adopscanlog -latest=yes"

to get the list of the log files along with snippet of the error message corresponding to each log file.


adop exiting with status = 1 (Fail)

I was really surprised as I had already run EBS technology codelevel checker (patch 17537119) script checkDBpatch.sh on racnode1.

To investigate I checked inside checkDBpatch.sh and found that it create a table called TXK_TCC_RESULTS.  

SQL> desc txk_tcc_results
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 TCC_VERSION                               NOT NULL VARCHAR2(20)
 BUGFIX_XML_VERSION                        NOT NULL VARCHAR2(20)
 NODE_NAME                                 NOT NULL VARCHAR2(100)
 DATABASE_NAME                             NOT NULL VARCHAR2(64)
 COMPONENT_NAME                            NOT NULL VARCHAR2(10)
 COMPONENT_VERSION                         NOT NULL VARCHAR2(20)
 COMPONENT_HOME                                     VARCHAR2(600)
 CHECK_DATE                                         DATE
 CHECK_RESULT                              NOT NULL VARCHAR2(10)
 CHECK_MESSAGE                                      VARCHAR2(4000)

SQL> select node_name from txk_tcc_results;

NODE_NAME
--------------------------------------------------------------------------------
RACNODE1

I ran checkDBpatch.sh again, but the patch failed again with previous error:

   [ERROR]     The EBS Technology Codelevel Checker (available as Patch 17537119) needs to be run on the following nodes: .

It was Saturday 5 AM already working through the night.  So I thought, it is better to sleep now and tackle this on Sunday.  On Sunday morning after a late breakfast, I looked at the problem again.  This time, I realized that the error was complaining about racnode1 (in lower case) and the txk_tcc_results table had RACNODE1(in upper case).  To test my hunch, I immediately updated the value:

update txk_tcc_results
set node_name='racnode1' where node_name='RACNODE1';

commit;

I restarted the patch, and it went through.  Patch was indeed failing because it was trying to look for a lower case value.  I will probably log an SR with Oracle, so that they change their code to make the node_name check case insensitive.

Further, I was curious, why node_name was stored in all caps in fnd_nodes and txk_tcc_results.  The file /etc/hosts had it in lowercase.  I tried the hostname command on linux prompt:

$ hostname
RACNODE1

That was something unusual, as in our environment, hostname always returns the value in lowercase.  So I further investigated.
[root@RACNODE1 ~]# sysctl kernel.hostname
kernel.hostname = RACNODE1

So I changed it

[root@RACNODE1 ~]# sysctl kernel.hostname=RACNODE1
kernel.hostname = racnode1
[root@RACNODE1 ~]# sysctl kernel.hostname
kernel.hostname = racnode1
[root@RACNODE1 ~]#
[root@RACNODE1 ~]# hostname
racnode1

Logged in again to see if root prompt changed:

[root@racnode1 ~]#

I also checked
[root@tsgld5811 ~]# cat /etc/sysconfig/network
NETWORKING=yes
NETWORKING_IPV6=no
NOZEROCONF=yes
HOSTNAME=RACNODE1

Changed it here also:
[root@tsgld5811 ~]# cat /etc/sysconfig/network
NETWORKING=yes
NETWORKING_IPV6=no
NOZEROCONF=yes
HOSTNAME=racnode1

I also changed it on racnode2.

Tuesday, February 24, 2015

cannot set user id: Resource temporarily unavailable or Fork: Retry: Resource Temporarily Unavailable

Amjad reported this error while trying to login to the server:

cannot set user id: Resource temporarily unavailable

In the past he had reported this error:

Fork: Retry: Resource Temporarily Unavailable

This is due to the fact that the user has run out of free stacks.  In OEL 6.x , the stack setting is not done in /etc/security/limits.conf but in the file:

/etc/security/limits.d/90-nproc.conf

The default content in the file is:

cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     1024
root       soft    nproc     unlimited

I changed this to:

After
$ cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     16384
root       soft    nproc     unlimited
$

As soon as this change was made, Amjad was able to login.

Wednesday, January 28, 2015

ERROR - CLONE-20372 Server port validation failed

Alok and Shoaib pinged me about this error. This error is reported in logs when adcfgclone.pl is run for a R12.2.4 appsTier where the source and target instances are on same physical server.

SEVERE : Jan 27, 2015 3:40:09 PM - ERROR - CLONE-20372   Server port validation failed.
SEVERE : Jan 27, 2015 3:40:09 PM - CAUSE - CLONE-20372   Ports of following servers - oacore_server2(7256),forms_server2(7456),oafm_server2(7656),forms-c4ws_server2(7856),oaea_server1(6856) - are not available.
4:00 PM
SEVERE : Jan 27, 2015 3:40:09 PM - ERROR - CLONE-20372   Server port validation failed.
SEVERE : Jan 27, 2015 3:40:09 PM - CAUSE - CLONE-20372   Ports of following servers - oacore_server2(7256),forms_server2(7456),oafm_server2(7656),forms-c4ws_server2(7856),oaea_server1(6856) - are not available.
SEVERE : Jan 27, 2015 3:40:09 PM - ACTION - CLONE-20372   Provide valid free ports.
oracle.as.t2p.exceptions.FMWT2PPasteConfigException: PasteConfig failed. Make sure that the move plan and the values specified in moveplan are correct

The ports reported are those in the source instance.  Searching on support.oracle.com bug database I found three articles:

EBS 12.2.2.4 RAPID CLONE FAILS WITH ERROR - CLONE-20372 SERVER PORT VALIDATION(Bug ID 20147454)

12.2: N->1 CLONING TO SAME APPS TIER FAILING DUE TO PORT CONFLICT(Bug ID 20389864)

FS_CLONE IS NOT ABLE TO COMPLETE FOR MULTI-NODE SETUP(Bug ID 18460148)

The situation described in the first two bugs is same.  The articles reference each other but don't provide any solution.

Logically thinking, adcfgclone.pl is picking this up from source configuration that is in $COMMON_TOP/clone directory.  So we did grep on subdirectories of $COMMON_TOP/clone:

cd $COMMON_TOP/clone
find . -type f -print | xargs grep 7256

7256 is one of the ports that failed validation.

It is present in

CTXORIG.xml and
FMW/ohs/moveplan.xml
FMW/wls/moveplan.xml

We tried changing the port numbers in CTXORIG.xml and re-tried adcfgclone.pl and it failed again.

So we changed the port numbers of the ports that failed validation in

$COMMON_TOP/clone/FMW/ohs/moveplan.xml and
$COMMON_TOP/clone/FMW/wls/moveplan.xml

cd $FMW_HOME
find . -name detachHome.sh |grep -v Template

The above command returns the detachHome.sh scripts for all the ORACLE_HOMEs inside FMW_HOME.  Executed this to detach all of them.

Removed the FMW_HOME directory

Re-executed
adcfgclone.pl appsTier

It succeeded this time.  Till we get a patch for this bug, we will continue to use this workaround to complete clones.


Friday, January 16, 2015

ERROR: The following required ports are in use: 6801 : WLS OAEA Application Port

Anil pinged me today when his adop phase=fs_clone failed with this error message:

-----------------------------
ERROR: The following required ports are in use:
-----------------------------
6801 : WLS OAEA Application Port
Corrective Action: Free the listed ports and retry the adop operation.

Completed execution : ADOPValidations.java

====================================
Inside _validateETCHosts()...
====================================

This is a bug mentioned in the appendix of article: 
Bug 19817016
The following errors are encountered when running fs_clone after completing AccessGate and OAM integration and after completing a patch cycle:

Checking  WLS OAEA Application Port on aolesc11:  Port Value = 6801
RC-50204: Error: - WLS OAEA Application Port in use: Port Value = 6801

-----------------------------
ERROR: The following required ports are in use:
-----------------------------
6801 : WLS OAEA Application Port
Corrective Action: Free the listed ports and retry the adop operation.

Workaround:
Stop the oaea managed server on the run file system before performing the fs_clone operation, immediately after the accessgate deployment.

Solution:
This issue will be addressed through Bug 19817016.

If you read the bug:

Click to add to FavoritesEmail link to this documentPrintable PageTo BottomTo Bottom
 

Bug Attributes

 

B - Defect
2 - Severe Loss of Service12.2.4
11 - Code/Hardware Bug (Response/Resolution)226 - Linux x86-64
14-Oct-2014
02-Dec-2014N/A
11.2.0.3Generic
OracleKnowledge, Patches and Bugs related to this bug
 

Related Products

 

Oracle E-Business SuiteApplications Technology
Technology Components1745 - Oracle Applications Technology Stack
Hdr: 19817016 11.2.0.3 FSOP 12.2.4 PRODID-1745 PORTID-226
Abstract: RUNNING ADOP FS_CLONE FAILS DUE TO PORT CONFLICT BETWEEN RUN AND PATCH EDITION
 
*** 10/14/14 11:58 am ***
Service Request (SR) Number:
----------------------------
 
 
Problem Statement:
------------------
Running fs_clone after completing EBS and OAM integration and after 
completing a patch cycle results in fs_clone failing with the following 
errors:
 
Checking  WLS OAEA Application Port on aolesc11:  Port Value = 6801
RC-50204: Error: - WLS OAEA Application Port in use: Port Value = 6801
 
-----------------------------
ERROR: The following required ports are in use:
-----------------------------
6801 : WLS OAEA Application Port
Corrective Action: Free the listed ports and retry the adop operation.
 
Detailed Results of Problem Analysis:
-------------------------------------
The problem is due to the newly added managed server port being the same for 
both the run and patch edition.  Going back to the sequence of steps and 
tracking the port assignment, it showed the following:
 
- deploy accessgate on patch
Creates managed server - oaea_server1:6801
This is the default port and doing this to the patch edition...
 
fs2 - run -> 6801 port
fs1 - patch -> 6801 port
 
- complete OAM registration
- close patching cycle
- cutover
- after cutover, SSO is working
 
fs1 - run -> 6801 port
fs2 - patch -> 6801 port
 
- fs_clone -> fails due to both run(fs1) and patch(fs2) referencing the same 
port 6801
 
Configuration and Version Details:
----------------------------------
OAM - 11.1.2.2.0
WG - 11.1.2.2.0
EAG - 1.2.3
WT - 11.1.1.6.0
 
EBS 12.2.4 w/ AD/TXK delta 5
 
Steps To Reproduce:
-------------------
As part of the EBS integration w/ OAM, we add a managed server for use as the 
EBS AccessGate (EAG) to the existing WLS in EBS.  There is an option to do 
this to both run edition, as well as the patch edition during an active patch 
cycle.  In this case the latter was done.  Here is a summary of the steps 
used:
 
1. Start patch cycle
2. Integrated OID and EBS
3. Cutover
4. Confirmed OID provisioning is working
5. Start patch cycle
6. Apply pre-req EBS patches for OAM
7. Proceed w/ OAM integration on patch file system
8. Cutover
9. Confirmed SSO/OAM is working
10. Run fs_clone -> this is where the issue appears
 
 
Additional Information:
-----------------------
The workaround here is to stop the oaea_server1 managed server operating in 
the run edition on port 6801, and then re-running fs_clone.  Once this is 
done, fs_clone completes and the patch edition now operates on port 6802 for 
the same managed server.
 
For A Severity 1 Bug: Justification and 24x7 Contact Details:
-------------------------------------------------------------
 
 
*** 10/14/14 01:19 pm ***
*** 10/16/14 07:05 am ***
*** 10/16/14 07:05 am ***
*** 10/17/14 01:47 am ***
*** 10/17/14 01:49 am *** 
*** 10/17/14 01:57 am ***
*** 10/17/14 08:47 am ***
*** 10/23/14 12:16 am ***
*** 10/23/14 12:17 am ***
*** 10/26/14 10:07 pm ***
*** 10/27/14 10:06 pm ***
*** 10/27/14 10:09 pm ***
*** 10/30/14 10:40 pm ***
*** 10/30/14 10:49 pm ***
*** 10/30/14 10:49 pm ***
*** 11/05/14 04:30 pm *** 
*** 11/05/14 04:30 pm ***
*** 11/06/14 10:59 am *** 
*** 11/17/14 09:20 pm ***
*** 12/02/14 12:36 am ***
*** 12/02/14 07:26 pm ***

Till a patch is made available, you need to shutdown the oaea managed server and restart fs_clone. So much for keeping all services online and the promise of no outage during fs_clone.