90dd8f3390fe472ad503307b2083590ab3a949c4
Investigated all 7 pending drive tickets in the ticketing DB. Identified 3 confirmed false positives and 1 parsing bug. Implemented manufacturer- specific SMART profiles and a systemic substring-match fix. Changes: - Seagate: disable Seek_Error_Rate (packed counter), add High_Fly_Writes profile threshold (100/500 vs the old 1/5), disable Command_Timeout (packed 3-part 48-bit format on Exos series) - Western Digital: disable Command_Timeout (same packed format) - Toshiba: new profile covering MG04-MG10 enterprise and MQ01-MQ04 consumer series; disable Raw/Seek counters, keep Command_Timeout with raised thresholds (1000/5000) since MG-series uses a real simple count; add model-prefix detection so MG08ACP16TE etc. match without "TOSHIBA" in the model string - OOS: add OOS14000G alias (fleet has both 12TB and 14TB variants); replace billion-scale Command_Timeout threshold with monitor:False - Samsung: disable Program_Fail_Cnt_Total (attr 181, vendor-encoded), Erase_Fail_Count_Chip (attrs 172/176, chip-level internal counter), Program_Fail_Count_Chip (attr 171); disable generic Erase_Fail_Count and Program_Fail_Count to prevent bleed-through from _Chip lines Bug fixes: - Fix substring match: 'Erase_Fail_Count' was matching 'Erase_Fail_Count_Chip' lines in both the first-pass and main attribute loops. Changed to token-boundary check (attr + ' ') in both places. - Add 32-bit overflow guard: raw SMART values > 0xFFFFFFFF are skipped at threshold comparison. Catches 0xFFFFFFFFFFFF sentinel values from unrecognized drives (was generating Critical Program_Fail_Cnt_Total tickets with value 281474976710655). BASE_SMART_THRESHOLDS: - High_Fly_Writes: 1/5 -> 100/500 - Program_Fail_Cnt_Total: 1/5 -> 50/200 - Erase_Fail_Count_Total: 1/5 -> 50/200 Global filtered_issues: removed Seek_Error_Rate and Command_Timeout (now handled per-profile); Raw_Read_Error_Rate kept as catch-all. Verified with --dry-run on all 4 servers: compute-storage-01, large1, compute-storage-gpu-01, pbs. Only legitimate issues surface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
System Health Monitoring Daemon
A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.
Features
- Comprehensive system health monitoring:
- Drive health (SMART status and disk usage)
- Memory usage
- CPU utilization
- Network connectivity (Management and Ceph networks)
- Automatic ticket creation for detected issues
- Configurable thresholds and monitoring parameters
- Dry-run mode for testing
- Systemd integration for automated daily checks
- LXC container storage monitoring
- Historical trend analysis for predictive failure detection
- Manufacturer-specific SMART attribute interpretation
- ECC memory error detection
Installation
- Copy the service and timer files to systemd:
sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/
- Reload systemd daemon:
sudo systemctl daemon-reload
- Enable and start the timer:
sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer
One liner (run as root)
curl -o /etc/systemd/system/hwmon.service http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer
Manual Execution
Direct Execution (from local file)
- Run the daemon with dry-run mode to test:
python3 hwmonDaemon.py --dry-run
- Run the daemon normally:
python3 hwmonDaemon.py
Remote Execution (same as systemd service)
Execute directly from repository without downloading:
- Run with dry-run mode to test:
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))" --dry-run
- Run normally (creates actual tickets):
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"
Configuration
The daemon monitors:
- Disk usage (warns at 80%, critical at 90%)
- LXC storage usage (warns at 80%, critical at 90%)
- Memory usage (warns at 80%)
- CPU usage (warns at 95%)
- Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
- SMART status of physical drives with manufacturer-specific profiles
- Temperature monitoring (warns at 65°C)
- Automatic duplicate ticket prevention
- Enhanced logging with debug capabilities
Data Storage
The daemon creates and maintains:
- Log Directory:
/var/log/hwmonDaemon/ - Historical SMART Data: JSON files for trend analysis
- Data Retention: 30 days of historical monitoring data
- Storage Limit: Automatically enforced 10MB maximum
- Cleanup: Oldest files deleted first when limit exceeded
Ticket Creation
The daemon automatically creates tickets with:
- Standardized titles including hostname, hardware type, and scope
- Detailed descriptions of detected issues with drive specifications
- Priority levels based on severity (P2-P4)
- Proper categorization and status tracking
- Executive summaries and technical analysis
Dependencies
- Python 3
- Required Python packages:
- psutil
- requests
- System tools:
- smartmontools (for SMART disk monitoring)
- nvme-cli (for NVMe drive monitoring)
Excluded Paths
The following paths are automatically excluded from monitoring:
/media/*/mnt/pve/mediafs/*/opt/metube_downloads- Pattern-based exclusions for media and download directories
Service Configuration
The daemon runs:
- Hourly via systemd timer (with 60-second randomized delay)
- As root user for hardware access
- With automatic restart on failure
- 5-minute timeout for execution
- Logs to systemd journal
Recent Improvements
Version 2.0 (January 2026):
- ✅ Added 10MB storage limit with automatic cleanup
- ✅ File locking to prevent race conditions
- ✅ Disabled monitoring for unreliable Ridata drives
- ✅ Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
- ✅ Fixed unchecked regex patterns
- ✅ Improved error handling throughout
- ✅ Enhanced systemd service configuration with restart policies
Troubleshooting
# View service logs
sudo journalctl -u hwmon.service -f
# Check service status
sudo systemctl status hwmon.timer
# Manual test run
python3 hwmonDaemon.py --dry-run
Security Note
Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.
CI
| Workflow | Purpose | Triggers |
|---|---|---|
lint.yml |
flake8 on all .py files |
Every push and PR |
security.yml |
bandit -ll (medium+ severity) |
Every push, PR, and weekly Monday 6am |
Branch protection is enabled on main — the lint check must pass before any PR can merge.
Lint config: .flake8 (max-line-length 120, F841/E501 ignored).
Description
A Python-based system health monitoring daemon that automatically tracks hardware status and creates tickets for detected issues in the LotusGuild Cluster.
Languages
Python
100%