data retention and large refactor of codebase
This commit is contained in:
47
README.md
47
README.md
@ -13,6 +13,10 @@ A robust system health monitoring daemon that tracks hardware status and automat
|
||||
- Configurable thresholds and monitoring parameters
|
||||
- Dry-run mode for testing
|
||||
- Systemd integration for automated daily checks
|
||||
- LXC container storage monitoring
|
||||
- Historical trend analysis for predictive failure detection
|
||||
- Manufacturer-specific SMART attribute interpretation
|
||||
- ECC memory error detection
|
||||
|
||||
## Installation
|
||||
|
||||
@ -53,19 +57,33 @@ python3 hwmonDaemon.py
|
||||
The daemon monitors:
|
||||
|
||||
- Disk usage (warns at 80%, critical at 90%)
|
||||
- LXC storage usage (warns at 80%, critical at 90%)
|
||||
- Memory usage (warns at 80%)
|
||||
- CPU usage (warns at 80%)
|
||||
- CPU usage (warns at 95%)
|
||||
- Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
|
||||
- SMART status of physical drives
|
||||
- SMART status of physical drives with manufacturer-specific profiles
|
||||
- Temperature monitoring (warns at 65°C)
|
||||
- Automatic duplicate ticket prevention
|
||||
- Enhanced logging with debug capabilities
|
||||
|
||||
## Data Storage
|
||||
|
||||
The daemon creates and maintains:
|
||||
|
||||
- **Log Directory**: `/var/log/hwmonDaemon/`
|
||||
- **Historical SMART Data**: JSON files for trend analysis
|
||||
- **Data Retention**: 30 days of historical monitoring data
|
||||
|
||||
|
||||
## Ticket Creation
|
||||
|
||||
The daemon automatically creates tickets with:
|
||||
|
||||
- Standardized titles including hostname, hardware type, and scope
|
||||
- Detailed descriptions of detected issues
|
||||
- Detailed descriptions of detected issues with drive specifications
|
||||
- Priority levels based on severity (P2-P4)
|
||||
- Proper categorization and status tracking
|
||||
- Executive summaries and technical analysis
|
||||
|
||||
## Dependencies
|
||||
|
||||
@ -73,7 +91,17 @@ The daemon automatically creates tickets with:
|
||||
- Required Python packages:
|
||||
- psutil
|
||||
- requests
|
||||
- System tools:
|
||||
- smartmontools (for SMART disk monitoring)
|
||||
- nvme-cli (for NVMe drive monitoring)
|
||||
|
||||
## Excluded Paths
|
||||
|
||||
The following paths are automatically excluded from monitoring:
|
||||
- `/media/*`
|
||||
- `/mnt/pve/mediafs/*`
|
||||
- `/opt/metube_downloads`
|
||||
- Pattern-based exclusions for media and download directories
|
||||
|
||||
## Service Configuration
|
||||
|
||||
@ -83,6 +111,19 @@ The daemon runs:
|
||||
- As root user for hardware access
|
||||
- With automatic restart on failure
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
```bash
|
||||
# View service logs
|
||||
sudo journalctl -u hwmon.service -f
|
||||
|
||||
# Check service status
|
||||
sudo systemctl status hwmon.timer
|
||||
|
||||
# Manual test run
|
||||
python3 hwmonDaemon.py --dry-run
|
||||
```
|
||||
|
||||
## Security Note
|
||||
|
||||
Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.
|
||||
|
||||
Reference in New Issue
Block a user