Jared Vititoe fe832c42f3 Fix critical reliability and security issues in hwmonDaemon
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses

Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal

These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:55:48 -05:00

System Health Monitoring Daemon

A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.

Features

  • Comprehensive system health monitoring:
    • Drive health (SMART status and disk usage)
    • Memory usage
    • CPU utilization
    • Network connectivity (Management and Ceph networks)
  • Automatic ticket creation for detected issues
  • Configurable thresholds and monitoring parameters
  • Dry-run mode for testing
  • Systemd integration for automated daily checks
  • LXC container storage monitoring
  • Historical trend analysis for predictive failure detection
  • Manufacturer-specific SMART attribute interpretation
  • ECC memory error detection

Installation

  1. Copy the service and timer files to systemd:
sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/
  1. Reload systemd daemon:
sudo systemctl daemon-reload
  1. Enable and start the timer:
sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer

One liner (run as root)

curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer

Manual Execution

  1. Run the daemon with dry-run mode to test:
python3 hwmonDaemon.py --dry-run
  1. Run the daemon normally:
python3 hwmonDaemon.py

Configuration

The daemon monitors:

  • Disk usage (warns at 80%, critical at 90%)
  • LXC storage usage (warns at 80%, critical at 90%)
  • Memory usage (warns at 80%)
  • CPU usage (warns at 95%)
  • Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
  • SMART status of physical drives with manufacturer-specific profiles
  • Temperature monitoring (warns at 65°C)
  • Automatic duplicate ticket prevention
  • Enhanced logging with debug capabilities

Data Storage

The daemon creates and maintains:

  • Log Directory: /var/log/hwmonDaemon/
  • Historical SMART Data: JSON files for trend analysis
  • Data Retention: 30 days of historical monitoring data

Ticket Creation

The daemon automatically creates tickets with:

  • Standardized titles including hostname, hardware type, and scope
  • Detailed descriptions of detected issues with drive specifications
  • Priority levels based on severity (P2-P4)
  • Proper categorization and status tracking
  • Executive summaries and technical analysis

Dependencies

  • Python 3
  • Required Python packages:
    • psutil
    • requests
  • System tools:
    • smartmontools (for SMART disk monitoring)
    • nvme-cli (for NVMe drive monitoring)

Excluded Paths

The following paths are automatically excluded from monitoring:

  • /media/*
  • /mnt/pve/mediafs/*
  • /opt/metube_downloads
  • Pattern-based exclusions for media and download directories

Service Configuration

The daemon runs:

  • Daily via systemd timer
  • As root user for hardware access
  • With automatic restart on failure

Troubleshooting

# View service logs
sudo journalctl -u hwmon.service -f

# Check service status
sudo systemctl status hwmon.timer

# Manual test run
python3 hwmonDaemon.py --dry-run

Security Note

Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.

Description
A Python-based system health monitoring daemon that automatically tracks hardware status and creates tickets for detected issues in the LotusGuild Cluster.
Readme 2.6 MiB
Languages
Python 100%