7383a0c67461e49699de62e4e4bab2184c0c9a7e
Add escape function to sanitize backslashes, double quotes, and newlines in label values per Prometheus text format spec. Prevents corrupted metrics output from model names or paths containing these characters. Resolves #10 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
System Health Monitoring Daemon
A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.
Features
- Comprehensive system health monitoring:
- Drive health (SMART status and disk usage)
- Memory usage
- CPU utilization
- Network connectivity (Management and Ceph networks)
- Automatic ticket creation for detected issues
- Configurable thresholds and monitoring parameters
- Dry-run mode for testing
- Systemd integration for automated daily checks
- LXC container storage monitoring
- Historical trend analysis for predictive failure detection
- Manufacturer-specific SMART attribute interpretation
- ECC memory error detection
Installation
- Copy the service and timer files to systemd:
sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/
- Reload systemd daemon:
sudo systemctl daemon-reload
- Enable and start the timer:
sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer
One liner (run as root)
curl -o /etc/systemd/system/hwmon.service http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer
Manual Execution
Direct Execution (from local file)
- Run the daemon with dry-run mode to test:
python3 hwmonDaemon.py --dry-run
- Run the daemon normally:
python3 hwmonDaemon.py
Remote Execution (same as systemd service)
Execute directly from repository without downloading:
- Run with dry-run mode to test:
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))" --dry-run
- Run normally (creates actual tickets):
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"
Configuration
The daemon monitors:
- Disk usage (warns at 80%, critical at 90%)
- LXC storage usage (warns at 80%, critical at 90%)
- Memory usage (warns at 80%)
- CPU usage (warns at 95%)
- Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
- SMART status of physical drives with manufacturer-specific profiles
- Temperature monitoring (warns at 65°C)
- Automatic duplicate ticket prevention
- Enhanced logging with debug capabilities
Data Storage
The daemon creates and maintains:
- Log Directory:
/var/log/hwmonDaemon/ - Historical SMART Data: JSON files for trend analysis
- Data Retention: 30 days of historical monitoring data
- Storage Limit: Automatically enforced 10MB maximum
- Cleanup: Oldest files deleted first when limit exceeded
Ticket Creation
The daemon automatically creates tickets with:
- Standardized titles including hostname, hardware type, and scope
- Detailed descriptions of detected issues with drive specifications
- Priority levels based on severity (P2-P4)
- Proper categorization and status tracking
- Executive summaries and technical analysis
Dependencies
- Python 3
- Required Python packages:
- psutil
- requests
- System tools:
- smartmontools (for SMART disk monitoring)
- nvme-cli (for NVMe drive monitoring)
Excluded Paths
The following paths are automatically excluded from monitoring:
/media/*/mnt/pve/mediafs/*/opt/metube_downloads- Pattern-based exclusions for media and download directories
Service Configuration
The daemon runs:
- Hourly via systemd timer (with 60-second randomized delay)
- As root user for hardware access
- With automatic restart on failure
- 5-minute timeout for execution
- Logs to systemd journal
Recent Improvements
Version 2.0 (January 2026):
- ✅ Added 10MB storage limit with automatic cleanup
- ✅ File locking to prevent race conditions
- ✅ Disabled monitoring for unreliable Ridata drives
- ✅ Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
- ✅ Fixed unchecked regex patterns
- ✅ Improved error handling throughout
- ✅ Enhanced systemd service configuration with restart policies
Troubleshooting
# View service logs
sudo journalctl -u hwmon.service -f
# Check service status
sudo systemctl status hwmon.timer
# Manual test run
python3 hwmonDaemon.py --dry-run
Security Note
Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.
Description
A Python-based system health monitoring daemon that automatically tracks hardware status and creates tickets for detected issues in the LotusGuild Cluster.
Languages
Python
100%