Almost all Evo service bureaus have at least some server gear inhouse: even if Evolution itself is hosted by Asure in AWS, you may still have a file server, print server, or the like.
And you probably have an Uninterruptible Power Supply (UPS), a battery backup, to keep your gear running if the power goes out. It's most helpful so a brief flicker on the power lines doesn't take down your servers, but for real outages all it really needs to do is keep the machines running long enough to be shut down gracefully.
Ungraceful shutdown is catastrophic to your gear and to your key business data held within.
Like anything else electronic, a UPS has limited life, but unlike other electronic gear, it can predict when it's going to fail: the "Replace batteries" light will come on when the UPS determines it's not holding a charge well.
UPS battery replacement is pretty straightforward, something most IT service firms can do for you easily, though the heavy batteries do have higher than usual shipping costs.
Having old/bad batteries means you have no protection from a real electrical power loss, and this is no surprise, but it's not the only risk involved. Having bad batteries can actually exacerbate the existing electrical problems by introducing new ones.
That tiny flicker on the power line that you don't see, and one that the redundant power supplies on your server gear could easily (ahem) power through, is correctly detected by the UPS as a power issue, but when it attempts to switch to battery, the wheels come off and your gear crashes.
These hard, ungraceful shutdowns can be devastating to your data; even if they don't physically damage the hardware, it's exceptionally common that a hard crash leaves the filesystem in a bad state such that the files on the filesystem—such as your Evolution databases on the /db/ partition— are corrupted.
A UPS with bad batteries is worse than having no UPS at all.
When your UPS tells you to replace the batteries, DO IT; your gear is at risk.
Temporary workaround
A workaround that I recommend while waiting for the batteries (or a new UPS if that's the issue) is—for gear with redundant power supplies—to plug one power cord into the UPS, and the other power cord into a regular wall socket.
The wall plug connection doesn't protect you from a true loss of electrical power, but it can protect you from a bad UPS.
This is not speculation: this happened today at a customer.
Four Evo servers were plugged into a UPS with bad batteries, but one of them (the request broker, as it happens) had its second power supply plugged into a regular wall socket.
When the UPS decided to crash all the gear plugged into it, only the request broker machine kept running. Evo was down, of course, but that machine didn't suffer any potentially catastrophic data corruption that the other servers were subject to.
The best solution for the long term, of course, is redundant UPS units, where each server with redundant gear plugs into both of them; this provides the clean, conditioned power that a UPS can provide (and the wall outlet cannot), as well as protection from a flakey UPS.
Power budgets
An important concern here: each UPS can only deliver so much juice, and there's a tradeoff of power versus runtime, and when running two UPS's, you have to be sure you know where all the power is going.
Machines with redundant power supplies typically only take power from one cord or the other, and you can't easily tell which of two UPS units it's drawing on. This means that when you look at the current load on a UPS, you can't tell if it's doing all the heavy lifting for all the servers, or if chance is drawing one server's power from the other.
The technical discussion of how to allocate your power budgets is beyond the scope of a simple blog post, but a simple rule is: if you have multiple UPS's, you cannot ever run with a combined load of more than 100%, because if one UPS fails, the other won't be able to pick up the slack.
Be sure to engage a qualified advisor when planning your UPS strategy to make sure you get your power budgets right.
In any case: do not volunteer for an emergency by neglecting the battery warning.
Comments