Customers who host their own Evolution gear sometimes handle all their own IT, but many outsource to local IT firms, and the MSP (Managed Service Provider) model has become very popular. In this model, customers pay a flat fee per month for each server or workstation, allowing for predictable cost projections.
In my experience, the first thing that any MSP does is install their remote management software on all the machines, including the Evolution servers. This software provides remote access for their technicians, it collects monitoring and alert data that phones home to their mothership, and allow for keeping a close eye on the machine.
You want this: Knowing that the C: drive is filling up, that there are critical errors in the event log, and any of the many other things that require attention, gives far better ongoing monitoring than a consultant like I provide: I'm not an MSP.
But the line has to be drawn between monitoring/remote access for the technicians (which helps your business), and proactive intervention on the machine (which is often bad).
Evolution servers have special requirements for reboots, because the Evolution application services have to be started in a certain order, and this almost never happens correctly by random chance.
Managed Service Providers must not configure any automated process that changes the system state, and must not ever perform reboots without coordinating with the Evolution software, whether automated or scheduled by a technician.
Almost every time a customer gets a new IT provider, we find Evolution problems in the morning that are traced back to some MSP-initiated action, such as installation of Windows updates overnight, but without any of the special knowledge of Evolution services.
The "aha!" moment while researching today's customer problem came from the Windows Event Log:
"LTSVC.exe" is LabTech (very popular MSP software), and it's initiated a restart of the server EVO6 just before 4 in the morning after installing a Windows update. Today, three of four Evo servers for this customer were rebooted, and it caused several hours of havoc until we figured out what happened (thankfully, it wasn't a repeat of the bad power problems we'd recently fixed, but we didn't know that without research).
This can be caused by fully automatic behavior of the LabTech software, or it could be that an MSP technician surveyed the systems yesterday and manually scheduled a 4AM install+restart; in this case, the lack of entries in the Task Scheduler suggests that this was LabTech running on autopilot.
Guidelines for MSPs / outside IT consultants
The Evolution line-of-business application servers are central to the business of your customer, and they are unfortunately kind of picky about things that they shouldn't be, but we still have to work around them.
- Evolution systems are generally working 24x7
- Though humans are not typically logging in to do payrolls at all hours of the day, service bureaus schedule many tasks to run overnight so that they're available in the morning, so it's very common to find the system busy at 11:30PM or 3AM or whatever. There is no time of day that's inherently "safe" for an automatic reboot.
- Also keep in mind that many bureaus service customers in all timezones, so 8PM on the East coast is still only 5PM in California.
- Evolution application services must be started in a certain order
- Just rebooting an Evolution server, especially the one hosting the Request Broker application service will almost always come back in a unusable state because the services are unlikely to happen to start in the proper order. I've written about that here:
- Evo Tip: Best Practices for Managing Evolution Services
- This is really unfortunate, and I believe it borders on a bug in the software, but we work with the software we have, not what we wish we had. There's no obvious or trivial way to work around this in an automated manner.
- Be sure your MSP software never automatically reboots an Evo server for any reason
- This includes a for-good-measure reboot, or one that applies a Windows update.
- Be sure your technicians know about this
- I find conscientious technicians schedule overnight reboots frequently, and those break Evolution just as an automated one does.
- DO NOT configure any automatic temporary file cleanups
- Evolution uses many temporary files, and some of them necessarily have to hang around for a long time, and (in particular) LabTech MSP software has a module to "clean up" older temporary files. This breaks Evolution badly - don't enable this module at all.
- DO NOT configure any automatic network tests/probes of the Evolution services
- Evolution services do not always respond well to random TCP/IP probes, which is unfortunate but still a fact of life: tests just to make sure the service is up typically cause more problems than good.
- And though properly-configured probes can be done in some cases, I've seen local mail relay (often running on one of the Evo middle tier machines) effectively disabled by port 25/tcp tests done badly, causing chronic email delivery problems until these unnecessary probes were disabled.
- DO NOT configure any realtime antivirus without coordinating file exclusions
- Evolution, especially the Request Broker, reacts very badly if one of its many working files gets locked while performing realtime scanning.
- In addition, if the AV engine mis-characterizes a valid Evolution process as malware, it's going to quarantine it and essentially bring down the entire system.
- This has happened to customers in the past, and it's very painful.
- DO NOT install any third-party network firewall protections.
- I generally leave the standard Windows native firewall enabled because I've got a good handle on the dozen or so exclusions required, but any third party firewall product is likely to do far more harm than good.
- In particular, there cannot be any SSL certificate inspection on the network stream arriving from the outside world into Remote Relay; some special sauce of how Evo handles SSL means no third party software's ever going to get it right.
- Watch out for Evo licensing landmines
- Evolution ties its licensing to the underlying hardware, and many have found themselves with a broken license due to completely innocent changes in the server configuration, the very definition of an unhappy surprise.
-
Do not change:
- The hard drive configuration, including plugging in an external USB hard drive. If it presents itself as "removable", it's fine, but if it tells the OS it's a permanent drive, that changes the overall hard drive config and will break the license.
-
Be super careful doing anything with the network interfaces; Evo includes the MAC address of the first network card it finds, but the order that NICs appear is highly unpredictable, and merely disabling an unused NIC is a common source of license breakage.
However, all the IP address parameters can be changed at any time: IP, DNS, gateway, DHCP configuration - no problem. Server hostname can be changed as well without concern.
- Do not change the "Registered Owner" or "Registered Organization" of the server; these encode into the license as well.
- Note that making a license-breaking change is not usually evident right away: it's only when a service (or the overall server) restart that it shows up, and it's then an emergency that can only be resolved with the Evolution support department.
- Tread very, very carefully with anything on the Linux database server.
- Other than passive monitoring, there's almost nothing that any MSP will be able to usefully do on the Linux database server.
- DO NOT run network security/vulnerability/portscan tests during working hours
- It's always fun to throw Nessus at the local network to see what ports are open and what vulnerabilities there might be, but Evolution does not react well to the unusual traffic presented.
- If a network vulnerability scan is required by an audit, or just by good practice, be sure to do it at a time when Evolution can be stopped and then restarted after the test. The scans won't actually damage any data, but Evo may not work well until its head is cleared.
- DO provide monitoring services and alerts
- It's a value to your customer to give a heads up that a hard drive is filling up, that some process is eating 100% of CPU, or that a power supply has gone bad.
- As you work with your customer's Evolution consultant, you'll get a sense for what things you should be watching for, and which things either don't matter, or are taken care of by somebody else.
Did the system reboot?
For the Evolution user finding an unexplained error showing up first thing in the morning, it's good to take a peek at the error message itself to see: did the machine reboot last night?
In this example:
Invalid property value date/time : 2019-07-23, 08:35:51, 769ms computer name : EVO6 user name : SYSTEMoperating system : Windows NT New Service Pack 1 build 7601 system language : English system up time : 4 hours 37 minutes program up time : 4 hours 37 minutes processors : 16x Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz ...
Hmmm, at 8:30 AM the machine had been up for 4 hours 37 minutes, so something happened just before 4AM; this is what has to be investigated (in addition to whatever is getting in the way of this particular Evolution task).
- Was somebody in the office that early doing maintenance?
- Did the machine have an operating system panic/bluescreen?
- Is there bad power?
- Was somebody in the server room and kicked a power cord?
- Did somebody turn on the "install updates and reboot" setting in Windows Update?
- Did a new MSP configure its own automatic update/reboot?
In our case, it was the latter, and it's been addressed with the MSP, and we believe that all the problems this morning were due to these inadvertent restarts.
Comments