High Availability and Flow Performance
What type of performance gains should you expect from a HA/HS setup of ElectricFlow?
With the release of 5.0, Electric Cloud has introduced High Availability (HA) and High Scalability (HS), and as a result some customers have questions about the benefit of these features. This article attempts to address some of those questions.
As a High-Availability solution, spinning up additional Flow servers gives you redundancy to protect from the risk of one of those servers inadvertently dropping. This same architecture change also provides you with the benefits of high-scalability, since the jobs will be distributed to a set of machines. However, whenever you scale up your system, your scale could push you to encounter a "weakest-link" scenario. If your Database already appears to be a bottleneck, moving to HA alone will not alleviate that problem. So if you are working towards setting up an HS solution with the expectation of having more work to go through the Commander system, you will need to ensure that your DB and other aspects of your environment (network, agent pools etc...) are ready to handle a load that is larger than what you have been trying previously.
While the HA architecture does allow for the management of work to be distributed to multiple servers, we still only run with a single scheduler on one of the Flow servers. Should the Flow server that is managing the scheduler be the one that experiences an unexpected drop, there would be a small delay of service while one of the other servers picks up that responsibility, but no jobs should be lost.
Let us repeat that, there is only one scheduler across all servers. Once a job begins, any steps will be assigned to be managed by a single server so that all the work for that job will go through that same server, but it's still up to the scheduler to manage the queue of steps and determine which steps should run when (priorities, order of submission, etc...). Note that each Flow server collects its own history inside the commander.log files, so having all steps for a single job get managed by a single server makes it convenient to trace through any log output tied to that job. The only time that pattern would not hold is if there is a downtime event on one of the HA servers such that the job gets re-distributed to be run through one of the other Commander server machines.
So do jobs ever get distributed across servers? I'm having trouble, other than in a failover, to see what advantage, if any, there is to having multiple Flow servers.
As stated, the current implementation will have a job allocated to a particular server, so that all of the associated steps will stay allocated to that server. The distribution of jobs is managed through a hashing mechanism, which will, over time, distribute jobs relatively evenly, but if you are looking at just a small number of big jobs, it's possible that 2 such jobs could end up being allocated to the same machine. So there is no "active number of jobs" or "active number of steps" awareness being made in the decision to select a particular machine to use when a new job first arrives.
This same "random" distribution mechanism is also used when a server drops and the jobs have to be moved over to run on other servers - so the expectation is that the work will be fairly evenly shipped to the remaining machines.
If you have a scenario where you launch some single jobs that contain a very large number of steps, this system may not be as suitable for High Scalability. Effectively, the HA distribution of work can be considered a coarse-grained approach. Attempting to implement a more fine-grained (step-based) model would certainly make the tracing of job flow more complicated (logs split between multiple nodes), but also would require additional status traffic to happen between the various servers when deciding on which steps to move forward with, so these complications were avoided by implementing the coarse-grained approach.
What is the mechanism by which jobs are distributed?
The server takes the ID from the job, and pushes it through a hashing function which ultimately ends up allocating the job to one of the active servers. Hence, it's relatively random, but expected to result in being fairly evenly distributed when measured against a larger volume of jobs.
With the coarse-grained nature of this solution, the High Scalability of large jobs with multiple steps will not be as effective as smaller jobs with fewer steps, but it will help to distribute jobs to multiple servers, thus allowing for more scalability. But do keep in mind, that creating a HA/HS environment with ElectricFlow may expose limitations in other areas of your system, so be prepared to make changes where necessary.
What can cause a different load distribution between nodes?
High distributions difference between nodes can be caused in case of very brief node outage. For example, there are two nodes, and one of them was rebooted. Every job it was handling at the time will get offloaded to the other node for those few minutes. This could increase the job count for that node a lot.
Brief node outage also can be caused by high average job runtime or if someone makes requests directly to a specific node that may cause an unbalanced load too.