KBEA-00031 - Configuring the Windows stalled job killer

Summary

You need to adjust the timeout used by an agent to detect stalled commands on the agent because you see messages like:

Command "XXX" was not making progress so it was automatically aborted.

Solution

Run the following commands (replace <cm> with the name of your cluster manager):

cmtool --cm=<cm> login <user> <password>
cmtool --cm=<cm> runAgentCmd  "agentexec timeout {{.* 120000 {disk}}}"
cmtool --cm=<cm> runAgentCmd  "agentexec timeout {{.* 120000 {cpu disk}}}"

To permanently change the settings, create (or edit) c:\ECloud\i686_win32\bin\runagent.local on each agent and add the following text, adjusting for the desired timeout (for example 7 minutes):

set commandTimeout\
    {\
         { {.*bin[/\\](ba)?sh.exe.*} 420000 {disk} }\
         { {.*} 420000 {cpu disk} }\
    }

After making this change it is recommended to reboot the machine.

To see what the current timeout setting is, run the following command (replace <cm> with the name of your cluster manager):

cmtool --cm=<cm> login <user> <password>
cmtool --cm=<cm> runAgentCmd "agentexec timeout"

Note:

  • runagent.local is located in the same location for both 32-bit and 64-bit systems.
  • The default timeout on a Windows agent is 1 minute (60000 ms) i.e., jobs on the agent will be timed out after 1 minute without any CPU or disk activity.
  • The timeout of a particular process can be configured by setting different parameters for that process name e.g., "sh.exe" or "bash.exe". The format of the tuples is "regexp milliseconds attributes", where "regexp" is a regular expression used to match against the process name; 'milliseconds" is the number of milliseconds after which a timeout is declared; and "attributes" is which things to check for activity, some combination of "disk" and "cpu".

Applies to

  • Product versions: All
  • OS versions: Windows

Applies to

  • Product versions: All
  • OS versions: Windows
Have more questions? Submit a request

Comments

Powered by Zendesk