Improve Pulse execution reliability: retry logic, better logging, SSH hardening

monitor.py / diagnose.py PulseClient.run_command:
- Add automatic single retry on submit failure, explicit Pulse failure
  (status=failed/timed_out), and poll timeout — handles transient SSH
  or Pulse hiccups without dropping the whole collection cycle
- Log execution_id and full Pulse URL on every failure so failed runs
  can be found in the Pulse UI immediately
- Handle 'timed_out' and 'cancelled' Pulse statuses explicitly (previously
  only 'failed' was caught; others would spin until local deadline)
- Poll every 2s instead of 1s to reduce Pulse API chatter

SSH command options (_ssh_batch + diagnose.py):
- Add BatchMode=yes: aborts immediately instead of hanging on a
  password prompt if key auth fails
- Add ServerAliveInterval=10 ServerAliveCountMax=2: SSH detects a
  hung remote command within ~20s instead of sitting silent until the
  45s Pulse timeout expires

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-15 09:19:07 -04:00
parent 2c67944b4b
commit b29b70d88b
2 changed files with 39 additions and 8 deletions

View File

@@ -77,7 +77,9 @@ class DiagnosticsRunner:
return (
f'ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 '
f'-o LogLevel=ERROR root@{ip_q} \'{remote_cmd}\''
f'-o BatchMode=yes -o LogLevel=ERROR '
f'-o ServerAliveInterval=10 -o ServerAliveCountMax=2 '
f'root@{ip_q} \'{remote_cmd}\''
)
# ------------------------------------------------------------------