KBEC-00241 - Jobs are stuck after the ElectricFlow server is brought back up

Summary

The ElectricFlow server went down (e.g., your database crashed) and stayed down for more than 24 hours. After the ElectricFlow server is brought up, you notice that jobs that were running before the crash are now stuck (no timeouts were set).

If the server is down, the agent (where the jobs were running) will retry with successively longer pauses (up to 30 seconds) for up to 24 hours before presuming the server is dead and dropping the message.

Solution

You can do one of the following:

  • Manually abort the job
  • Restart the agent with the stuck jobs

If you restart the agent, the server will realize the agent restarted (the next time the server tries to run a command on the agent, or if the server pings the agent), and the ElectricFlow server will abort the running steps. This is because the agent restart is conclusive evidence that running steps from the prior agent life are no longer running.

Additionally, you can change the 24 hour limit by using the --retryTimeout global server option to change the timeout for a specific API call.

ectool --retryTimeout <s>  

Amount of time to continue retrying requests that fail due to communication errors. Defaults to --timeout value unless running in a job step, in which case the default is 24 hours.

 

Have more questions? Submit a request

Comments

Powered by Zendesk