6fb0d8951939a7eaa8e073ca34c7ee16baae3948
System Health Monitoring Daemon
A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.
Features
- Comprehensive system health monitoring:
- Drive health (SMART status and disk usage)
- Memory usage
- CPU utilization
- Network connectivity (Management and Ceph networks)
- Automatic ticket creation for detected issues
- Configurable thresholds and monitoring parameters
- Dry-run mode for testing
- Systemd integration for automated daily checks
Installation
- Copy the service and timer files to systemd:
sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/
- Reload systemd daemon:
sudo systemctl daemon-reload
- Enable and start the timer:
sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer
One liner (run as root)
curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer
Manual Execution
- Run the daemon with dry-run mode to test:
python3 hwmonDaemon.py --dry-run
- Run the daemon normally:
python3 hwmonDaemon.py
Configuration
The daemon monitors:
- Disk usage (warns at 80%, critical at 90%)
- Memory usage (warns at 80%)
- CPU usage (warns at 80%)
- Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
- SMART status of physical drives
Ticket Creation
The daemon automatically creates tickets with:
- Standardized titles including hostname, hardware type, and scope
- Detailed descriptions of detected issues
- Priority levels based on severity (P2-P4)
- Proper categorization and status tracking
Dependencies
- Python 3
- Required Python packages:
- psutil
- requests
- smartmontools (for SMART disk monitoring)
Service Configuration
The daemon runs:
- Daily via systemd timer
- As root user for hardware access
- With automatic restart on failure
Security Note
Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.
Description
A Python-based system health monitoring daemon that automatically tracks hardware status and creates tickets for detected issues in the JWS Cluster.
Languages
Python
100%