Catastrophic Failover

7 March 2005

One of the oft advertised features of modern application servers is that they provide failover in a cluster. Clustering improves the reliability of your application, if one of your servers goes down, you have some more up to server your customers. Failover can add even more reliability, if a server goes down in the middle of a interaction the cluster can move that interaction to another server.

However this can be a problem.

A request may do something that causes a server to crash, perhaps by unwittingly exposing a bug in the server software. So when the failover kicks in, the deadly requests gets moved to another server which it can then bring down in turn. Get the timing just right and by the time the first server has rebooted, it will be ready to receive that request again.

(In case you're wondering, this is a true story.)

So if you see your servers repeatedly going down, an errant transaction could well be the cause. To prevent this, you need a check to ensure that you don't migrate a request that's already been in a couple of failovers. It's good to failover, but you don't want your farm to do it too often.

Update: Christopher Baus pointed out that this problem suggests you should deliberately use different equipment on your cluster. So if you're running a Java application, consider using a mix of different app servers, operating systems, and hardware. A mix is more complex to manage, of course, but greatly reduces the chance of this problem happening.