More GlassFish loadbalancing tips for Connector/J

Almost two weeks ago, I encouraged GlassFish users who need load-balanced JDBC connections to MySQL Cluster (or master-master replicated MySQL Server) to set the loadBalanceValidateConnectionOnSwapServer property to true in order to help ensure the connection chosen at re-balance is still usable.  That advice triggered finding a bug (14563127) which will cause the following Exception message:

No operations allowed after connection closed. Connection closed after inability to pick valid new connection during fail-over.

If you implemented the loadBalanceValidateConnectionOnSwapServer property and are seeing the above error message, updating your driver to the newly-released 5.1.22 build will likely solve this problem.  Here’s a quick look at internals of Connector/J re-balance operations, some additional configuration suggestions, and details on the bug fixed in 5.1.22:

When loadBalanceValidateConnectionOnSwapServer is enabled, Connector/J attempts a ping using the newly-selected physical connection.  This happens in a try/catch block, inside a loop, which executes one more than the number of configured hosts (meaning, if you have two configured load-balanced hosts, the loop will execute a maximum of three times).  If the ping fails in the try block, the catch block will add the host to the global blacklist (if configured, and it should be), and the loop tries again, selecting and testing a new connection.  The only way to get the Exception message shown above is when Connector/J tries three times (on a two-host load-balanced configuration) to find a valid connection.  Not considering the bug, the following three scenarios are the most likely to cause the above Exception message:

  1. The servers (or intermediate network) went down, and Connector/J can’t reach any configured host.
  2. The global blacklist is not configured, allowing the load balancing strategy to pick the same down host three times consecutively.
  3. The global blacklist timeout is shorter than the duration that Connector/J waits for the ping operation to complete, allowing the same invalid host to be tried multiple times within the same loop.

There’s not much to be done about the first situation – if the hosts are all unreachable, there’s nothing Connector/J can do at this point.

The second scenario should be avoided by setting loadBalanceBlacklistTimeout to a non-zero value.  This is how long a blacklisted host should live in the global blacklist before it may be re-tested.  The value of this depends on your deployment environment.  If you have unreliable network, you might want to set this shorter to allow Connector/J to re-test the host more frequently than in situations where you regularly bring hosts down for maintenance.  Testing connection validity takes time during re-balance (which is why it’s not enabled by default) – and that means slower responses to operations like commit() or rollback().  Your goal in configuring loadBalanceBlacklistTimeout is to set a duration that will balance between the need to avoid perform expensive connection validation on an unavailable host and the need to recognize when the host is brought back online.  You don’t have to worry that every host will be added to the global blacklist, and your application will be stuck – when the global blacklist is full and

The third scenario can be controlled by setting the loadBalancePingTimeout property to a value that is significantly smaller than loadBalanceBlacklistTimeout.  This is the time Connector/J will wait for the ping command to complete.  This should be fairly low – 100ms should be sufficient for most deployments.  In a worst-case scenario, you may need to ping each host in a re-balance cycle, and you’ll want that to complete before loadBalanceBlacklistTimeout expires.  Make sure that loadBalancePingTimeout does not exceed loadBalanceBlacklistTimeout  / number of configured hosts.

The bug that was fixed caused the previously-used connection to be added to the global blacklist instead of the newly-selected and tested connection.  As a result, in a two-server configuration where there was an active connection to Server A and a stale physical connection to Server B, re-balancing could choose Server B’s stale connection, test it, discover that it is stale, but add Server A to the global blacklist.  Subsequent iterations of the loop would do the same thing – selecting, creating and testing connections to Server B.  At this point, the only way a valid connection will be found is if a connection can be established to Server B.  If Server B is offline, the loop will terminate having only tried to connect to Server B, and the connection will be closed – even though Server A can support viable connections.

Recapping the recommendations made here:

  1. Update to Connector/J 5.1.22 if you see the referenced Exception
  2. Make sure loadBalanceBlacklistTimeout  is set to non-zero value, appropriate for your deployment needs
  3. Make sure loadBalancePingTimeout is set to a non-zero value, smaller than loadBalanceBlacklistTimeout  / number of configured hosts.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.