130 lines
3.4 KiB
Markdown
130 lines
3.4 KiB
Markdown
# System Health Monitoring Daemon
|
|
|
|
A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.
|
|
|
|
## Features
|
|
|
|
- Comprehensive system health monitoring:
|
|
- Drive health (SMART status and disk usage)
|
|
- Memory usage
|
|
- CPU utilization
|
|
- Network connectivity (Management and Ceph networks)
|
|
- Automatic ticket creation for detected issues
|
|
- Configurable thresholds and monitoring parameters
|
|
- Dry-run mode for testing
|
|
- Systemd integration for automated daily checks
|
|
- LXC container storage monitoring
|
|
- Historical trend analysis for predictive failure detection
|
|
- Manufacturer-specific SMART attribute interpretation
|
|
- ECC memory error detection
|
|
|
|
## Installation
|
|
|
|
1. Copy the service and timer files to systemd:
|
|
```bash
|
|
sudo cp hwmon.service /etc/systemd/system/
|
|
sudo cp hwmon.timer /etc/systemd/system/
|
|
```
|
|
2. Reload systemd daemon:
|
|
```bash
|
|
sudo systemctl daemon-reload
|
|
```
|
|
3. Enable and start the timer:
|
|
```bash
|
|
sudo systemctl enable hwmon.timer
|
|
sudo systemctl start hwmon.timer
|
|
```
|
|
|
|
### One liner (run as root)
|
|
```bash
|
|
curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer
|
|
```
|
|
|
|
## Manual Execution
|
|
|
|
1. Run the daemon with dry-run mode to test:
|
|
```bash
|
|
python3 hwmonDaemon.py --dry-run
|
|
```
|
|
2. Run the daemon normally:
|
|
```bash
|
|
python3 hwmonDaemon.py
|
|
```
|
|
|
|
|
|
## Configuration
|
|
|
|
The daemon monitors:
|
|
|
|
- Disk usage (warns at 80%, critical at 90%)
|
|
- LXC storage usage (warns at 80%, critical at 90%)
|
|
- Memory usage (warns at 80%)
|
|
- CPU usage (warns at 95%)
|
|
- Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
|
|
- SMART status of physical drives with manufacturer-specific profiles
|
|
- Temperature monitoring (warns at 65°C)
|
|
- Automatic duplicate ticket prevention
|
|
- Enhanced logging with debug capabilities
|
|
|
|
## Data Storage
|
|
|
|
The daemon creates and maintains:
|
|
|
|
- **Log Directory**: `/var/log/hwmonDaemon/`
|
|
- **Historical SMART Data**: JSON files for trend analysis
|
|
- **Data Retention**: 30 days of historical monitoring data
|
|
|
|
|
|
## Ticket Creation
|
|
|
|
The daemon automatically creates tickets with:
|
|
|
|
- Standardized titles including hostname, hardware type, and scope
|
|
- Detailed descriptions of detected issues with drive specifications
|
|
- Priority levels based on severity (P2-P4)
|
|
- Proper categorization and status tracking
|
|
- Executive summaries and technical analysis
|
|
|
|
## Dependencies
|
|
|
|
- Python 3
|
|
- Required Python packages:
|
|
- psutil
|
|
- requests
|
|
- System tools:
|
|
- smartmontools (for SMART disk monitoring)
|
|
- nvme-cli (for NVMe drive monitoring)
|
|
|
|
## Excluded Paths
|
|
|
|
The following paths are automatically excluded from monitoring:
|
|
- `/media/*`
|
|
- `/mnt/pve/mediafs/*`
|
|
- `/opt/metube_downloads`
|
|
- Pattern-based exclusions for media and download directories
|
|
|
|
## Service Configuration
|
|
|
|
The daemon runs:
|
|
|
|
- Daily via systemd timer
|
|
- As root user for hardware access
|
|
- With automatic restart on failure
|
|
|
|
## Troubleshooting
|
|
|
|
```bash
|
|
# View service logs
|
|
sudo journalctl -u hwmon.service -f
|
|
|
|
# Check service status
|
|
sudo systemctl status hwmon.timer
|
|
|
|
# Manual test run
|
|
python3 hwmonDaemon.py --dry-run
|
|
```
|
|
|
|
## Security Note
|
|
|
|
Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.
|