T

jared 841db13459 Fix false positive ticket creation for manufacturer operation counters

Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.

Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)

These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.

Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-06 17:00:32 -05:00

hwmon.service

Fix critical reliability and security issues in hwmonDaemon

2026-01-06 16:55:48 -05:00

hwmon.timer

Fix critical reliability and security issues in hwmonDaemon

2026-01-06 16:55:48 -05:00

hwmonDaemon.py

Fix false positive ticket creation for manufacturer operation counters

2026-01-06 17:00:32 -05:00

README.md

Update README with hourly execution schedule and recent improvements

2026-01-06 16:57:16 -05:00

README.md

System Health Monitoring Daemon

A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.

Features

Comprehensive system health monitoring:
- Drive health (SMART status and disk usage)
- Memory usage
- CPU utilization
- Network connectivity (Management and Ceph networks)
Automatic ticket creation for detected issues
Configurable thresholds and monitoring parameters
Dry-run mode for testing
Systemd integration for automated daily checks
LXC container storage monitoring
Historical trend analysis for predictive failure detection
Manufacturer-specific SMART attribute interpretation
ECC memory error detection

Installation

Copy the service and timer files to systemd:

sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/

Reload systemd daemon:

sudo systemctl daemon-reload

Enable and start the timer:

sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer

One liner (run as root)

curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer

Manual Execution

Run the daemon with dry-run mode to test:

python3 hwmonDaemon.py --dry-run

Run the daemon normally:

python3 hwmonDaemon.py

Configuration

The daemon monitors:

Disk usage (warns at 80%, critical at 90%)
LXC storage usage (warns at 80%, critical at 90%)
Memory usage (warns at 80%)
CPU usage (warns at 95%)
Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
SMART status of physical drives with manufacturer-specific profiles
Temperature monitoring (warns at 65°C)
Automatic duplicate ticket prevention
Enhanced logging with debug capabilities

Data Storage

The daemon creates and maintains:

Log Directory: /var/log/hwmonDaemon/
Historical SMART Data: JSON files for trend analysis
Data Retention: 30 days of historical monitoring data
Storage Limit: Automatically enforced 10MB maximum
Cleanup: Oldest files deleted first when limit exceeded

Ticket Creation

The daemon automatically creates tickets with:

Standardized titles including hostname, hardware type, and scope
Detailed descriptions of detected issues with drive specifications
Priority levels based on severity (P2-P4)
Proper categorization and status tracking
Executive summaries and technical analysis

Dependencies

Python 3
Required Python packages:
- psutil
- requests
System tools:
- smartmontools (for SMART disk monitoring)
- nvme-cli (for NVMe drive monitoring)

Excluded Paths

The following paths are automatically excluded from monitoring:

/media/*
/mnt/pve/mediafs/*
/opt/metube_downloads
Pattern-based exclusions for media and download directories

Service Configuration

The daemon runs:

Hourly via systemd timer (with 60-second randomized delay)
As root user for hardware access
With automatic restart on failure
5-minute timeout for execution
Logs to systemd journal

Recent Improvements

Version 2.0 (January 2026):

✅ Added 10MB storage limit with automatic cleanup
✅ File locking to prevent race conditions
✅ Disabled monitoring for unreliable Ridata drives
✅ Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
✅ Fixed unchecked regex patterns
✅ Improved error handling throughout
✅ Enhanced systemd service configuration with restart policies

Troubleshooting

# View service logs
sudo journalctl -u hwmon.service -f

# Check service status
sudo systemctl status hwmon.timer

# Manual test run
python3 hwmonDaemon.py --dry-run

Security Note

Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.