When our daughter was about 2, my wife and I noticed she had become almost disturbingly attached to a particular blanket. She carried it everywhere and slept with it every night; they were inseparable. There was nothing remarkable about the blanket, except that if Daniella (our daughter) couldn’t find it she would melt down like a popsicle on the sun. Being reasonably intelligent people who valued the few moments of peace and quiet possible with a 2-year-old (i.e. when they are sleeping) we made a bold move that turned out to be one of our most brilliant parenting decisions ever:
We cut the blanket in half.
Sounds cruel, I know, but the thing is, Daniella didn’t even notice. She carried around her half-blanket as contentedly as the full-sized version, and suddenly my wife and I had the most valuable commodity imaginable in our quest for domestic balance: A hot spare blanket. Wherever we went, we could safely take the half-blanket without fear of complete blanket failure; half would go with us, and half stayed at home.
Extending our brilliant logic, after a couple of half-blanket-loss scares, we halved the halves and quartered the blanket. Now we could stash one quarter in the car and give one to Grandma in addition to the home spare and “production” quarters.
We could have taken a number of more complicated approaches as well: We could have attached a magnet to Danialla and a magnet to the blanket; we could have tethered the blanket to her ankle; we could have set up an alarm to go off as soon as she lost contact with the blanket for more than 30 seconds. But why? If she lost the blanket, even for the minute it took to pull out the replacement, she would be upset and we’d have to calm her down. Our goal was not to try and create some magical world where the blanket was un-losable; we tried to create a situation where we could replace it as fast as possible.
Sometimes I think such simple solutions get lost in the complicated world of IT infrastructure. We have clustering, hot failover, live migration, and on and on and on. Problem is, we’ve all seen clustering and complicated redundancy architectures collapse under their own weight. Just last week (true story) I saw a Cisco IOS bug cause a router to run out of memory. Since this particular router had an active failover, it should have been no problem, right? Wrong. When the active router ran out of memory it lost its configuration (maybe that’s another bug – I don’t know and it doesn’t matter) and promptly forgot it had a failover partner. All routing was down for a a couple of minutes while the failover was manually engaged.
Was the investment in the spare router a good one? Obviously. But that fancy auto-failover option cost extra money, too, and it failed the owners exactly when they needed it. It literally had one job to do, so what good was it? Obviously the automatic failover works in many cases, but we manually failed over and the sun came up the next day.
Customers ask me about RAC for E-Business Suite all the time. ”We can’t have any downtime!” they pant. Really? What about all that application patching? Database upgrades? R12 upgrade? And what do you think will happen to all the concurrent manager and user sessions attached to a failed RAC instance? Do they somehow maintain state from the failed node and re-attach, mid-query to the running node? (no) I think RAC is a fantastic technology when properly planned for and managed, but it’s complicated. Clustering software is still software, after all, and can make a decision to fail over when a human would not. Regardless, the storage, no matter how internally redundant, is still a single point of failure for RAC; corrupt the storage and you’re in recovery mode no matter how much money was spent.
Rather than spending madly in circles to prevent all possible downtime, why not concede a few minutes of failover and simply set up a standby? Data Guard is far less complicated than RAC, it’s rock-solid, and if a RAC node goes down some sessions will have to be restated anyway – a little bit of downtime is unavoidable. Instead of buying expensive, internally redundant application nodes, why not buy a whole pile of inexpensive white-box Linux servers and run parallel concurrent managers and load-balanced web and forms?
The cost of software, hardware, and labor to make sure nothing ever goes down has no ceiling. We have multi-pathed database nodes running RAC instances on top of RAID 10 storage with redundant heads and controllers in cabinets with redundant power and still the database will unexpectedly go down from time-to-time. In no way am I suggesting we throw up our hands and take no precautions (e.g. disk should always be mirrored), but sometimes the simplest solution is still the best, not to mention less expensive.
Instead of RAC on top of an expensive, internally redundant SAN, why not separate database servers running Data Guard on top of smaller, less expensive storage arrays, all stored in separate cabinets. The RAC solution is more complicated, more expensive, yet cannot guarantee 100% uptime; the Data Guard solution can fail over very quickly, is less expensive, is simpler to manage, and is far less susceptible to single-points-of-failure to boot!
Clustering, RAC, and other HA solutions are probably the right solution for many IT problems, but not as a rule. Sometimes we need to step back, evaluate the impact of most realistic failures, and make the decision to minimize downtime instead of throwing thousands upon thousands of dollars at vain attempts to cure downtime.
Sometimes it’s best to just cut the blanket in two.