Tuesday, April 24, 2012

On Weblogic session replication messages

This post discusses two common error messages you may encounter when managing Weblogic clusters with in memory replication: BEA-100089 and BEA-000117 The goal is to try to explain what to do (if anything has to be done) when you encounter them.

BEA-100089 The session id: sessionID has been accessed from currentServer, a server that is neither the primary (primaryServer) nor the secondary (secondaryServer). The request URL was: requestUrl

As the message says, a request for a given sessionId has been routed to a node of the cluster which was not hosting neither the primary copy of the session nor the secondary one (if using in memory replication, each HTTP session is created and served on a node of the cluster, and at the end of each request it is replicated on a secondary node).

The first question is: if I see this message should I think that the load balancer is routing requests to the wrong node? 
Not necessarily.
Despite in case of bad routing this message is likely to appear, there are a lot of cases where it does not point out any specific issue.
Let's say the primary node for a session is shutdown. At the following request, two behavior can be seen. If the load balancer sends the requet to the node that was the former secondary, no message will be shown. This node already has a copy of the session, thus it will serve it and will just elect a new secondary at the end of the transaciton.
This is not always the case, in fact the load balancer (at least this is true for wlproxy plugin) upon failure of the primary node, intentionally chooses randomly the new node instead of choosing the former secondary. This happens to avoid to keep the amount of sessions well balanced redistributing the sessions previously held by the failing node.

How can the node know who was the primary and who was the secondary?
This is feasible since the session id contains the JVM id of both primary and secondary in this form:
<SESSIONID>!<PRIMARY_JVM>!<SECONDARY_JVM>!<SESSION_CREATION_TIMESTAMP>
Thus each node receiving a request with a session id is able to know if it is supposed to know the session and who are primary and secondary.

Here is a real life example of this message to understand what we can extract from it:
<Apr 23, 2012 11:13:34 PM CEST>
<Warning>
<HTTP Session>
<BEA-100089>
<The session ID:
CJ1nPVGDlQCk2prqZ3hrChTQGZwXvpQG3FmnxLxC6TYZDlKJmVF7
has been accessed from
810226440766491326S::myCluster:serv1,
a server that is neither the primary
(2703743310001936216S:192.168.1.13:[7004,7004,-1,-1,-1,-1,-1]:myCluster:serv2)
nor the secondary
(8742509417973544740S:192.168.1.13:[7005,7005,-1,-1,-1,-1,-1]:myCluster:serv3).
The request URL was: http://192.168.1.13:7003/clusteredWebapp/UselessServlet>
  • 810226440766491326S::myCluster:serv1 is the identifier of the server that received the request: id::cluster name::server name
  • Primary and secondary are expressed in this form: 2703743310001936216S:192.168.1.13:[7004,7004,-1,-1,-1,-1,-1]:myCluster:serv2. Including server id, ip, port, replication port, cluster and server name.

BEA-000117 Received a stale replication request for object id.

This message literally means that a node (the one where the message is logged) received a copy of an object for replication from another node but the node receiveing the copy already had a copy of the same object of the same session.

Most of the time this happens in case there is a real issue on the node sending the replica, so that the primary take an insanely long time to send the replica to the secondary. Meanwhile a new request is served and the load balancer ignores the former primary. In this case the replica of the second transaciton can get to the secondary node before the replica of the first.

There is anyway a quite common case where this message can appear (together with BEA-000089) and such that no real issue is present. Let's imagine we have a cluster with three servers: serv1, serv2 and serv3.
We have our session being served by serv1 and such that serv2 is the secondary node. Now, for any reason, the load balancer decides to make the session failover on serv3 and serv3 elects serv2 as secondary.
After a while for some reason the session is failed over again to serv1 (which still had an old version of the session). Since serv1 already has a copy of the sesison it won't try to get the up to date content of the session from the former secondary. It will anyway try to elect a secondary node and send the stale session there during replication.
If the elected secondary already had a copy of the session (serv2 for example), this will trigger an occurrence of this message even if there is no real infrastructure issue.
This message is not only a warning. If it appears it means that replication failed and that the secondary has not been elected. This can be verified through weblogic console  on monitoring section in cluster.