# System Health Monitoring Daemon A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues. ## Features - Comprehensive system health monitoring: - Drive health (SMART status and disk usage) - Memory usage - CPU utilization - Network connectivity (Management and Ceph networks) - Automatic ticket creation for detected issues - Configurable thresholds and monitoring parameters - Dry-run mode for testing - Systemd integration for automated daily checks - LXC container storage monitoring - Historical trend analysis for predictive failure detection - Manufacturer-specific SMART attribute interpretation - ECC memory error detection ## Installation 1. Copy the service and timer files to systemd: ```bash sudo cp hwmon.service /etc/systemd/system/ sudo cp hwmon.timer /etc/systemd/system/ ``` 2. Reload systemd daemon: ```bash sudo systemctl daemon-reload ``` 3. Enable and start the timer: ```bash sudo systemctl enable hwmon.timer sudo systemctl start hwmon.timer ``` ### One liner (run as root) ```bash curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer ``` ## Manual Execution 1. Run the daemon with dry-run mode to test: ```bash python3 hwmonDaemon.py --dry-run ``` 2. Run the daemon normally: ```bash python3 hwmonDaemon.py ``` ## Configuration The daemon monitors: - Disk usage (warns at 80%, critical at 90%) - LXC storage usage (warns at 80%, critical at 90%) - Memory usage (warns at 80%) - CPU usage (warns at 95%) - Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks - SMART status of physical drives with manufacturer-specific profiles - Temperature monitoring (warns at 65°C) - Automatic duplicate ticket prevention - Enhanced logging with debug capabilities ## Data Storage The daemon creates and maintains: - **Log Directory**: `/var/log/hwmonDaemon/` - **Historical SMART Data**: JSON files for trend analysis - **Data Retention**: 30 days of historical monitoring data ## Ticket Creation The daemon automatically creates tickets with: - Standardized titles including hostname, hardware type, and scope - Detailed descriptions of detected issues with drive specifications - Priority levels based on severity (P2-P4) - Proper categorization and status tracking - Executive summaries and technical analysis ## Dependencies - Python 3 - Required Python packages: - psutil - requests - System tools: - smartmontools (for SMART disk monitoring) - nvme-cli (for NVMe drive monitoring) ## Excluded Paths The following paths are automatically excluded from monitoring: - `/media/*` - `/mnt/pve/mediafs/*` - `/opt/metube_downloads` - Pattern-based exclusions for media and download directories ## Service Configuration The daemon runs: - Daily via systemd timer - As root user for hardware access - With automatic restart on failure ## Troubleshooting ```bash # View service logs sudo journalctl -u hwmon.service -f # Check service status sudo systemctl status hwmon.timer # Manual test run python3 hwmonDaemon.py --dry-run ``` ## Security Note Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.