LotusGuild/hwmonDaemon

Fork 0

T

jared 90dd8f3390

Lint / Python (flake8) (push) Failing after 1m13s

Details

Security / Python Security (bandit) (push) Successful in 46s

Details

Test / Python Tests (pytest) (push) Successful in 1m2s

Details

Lint / Notify on failure (push) Successful in 2s

Details

fix: calibrate SMART thresholds per manufacturer to eliminate false positives

Investigated all 7 pending drive tickets in the ticketing DB. Identified
3 confirmed false positives and 1 parsing bug. Implemented manufacturer-
specific SMART profiles and a systemic substring-match fix.

Changes:
- Seagate: disable Seek_Error_Rate (packed counter), add High_Fly_Writes
  profile threshold (100/500 vs the old 1/5), disable Command_Timeout
  (packed 3-part 48-bit format on Exos series)
- Western Digital: disable Command_Timeout (same packed format)
- Toshiba: new profile covering MG04-MG10 enterprise and MQ01-MQ04
  consumer series; disable Raw/Seek counters, keep Command_Timeout with
  raised thresholds (1000/5000) since MG-series uses a real simple count;
  add model-prefix detection so MG08ACP16TE etc. match without "TOSHIBA"
  in the model string
- OOS: add OOS14000G alias (fleet has both 12TB and 14TB variants);
  replace billion-scale Command_Timeout threshold with monitor:False
- Samsung: disable Program_Fail_Cnt_Total (attr 181, vendor-encoded),
  Erase_Fail_Count_Chip (attrs 172/176, chip-level internal counter),
  Program_Fail_Count_Chip (attr 171); disable generic Erase_Fail_Count
  and Program_Fail_Count to prevent bleed-through from _Chip lines

Bug fixes:
- Fix substring match: 'Erase_Fail_Count' was matching
  'Erase_Fail_Count_Chip' lines in both the first-pass and main attribute
  loops. Changed to token-boundary check (attr + ' ') in both places.
- Add 32-bit overflow guard: raw SMART values > 0xFFFFFFFF are skipped
  at threshold comparison. Catches 0xFFFFFFFFFFFF sentinel values from
  unrecognized drives (was generating Critical Program_Fail_Cnt_Total
  tickets with value 281474976710655).

BASE_SMART_THRESHOLDS:
- High_Fly_Writes: 1/5 -> 100/500
- Program_Fail_Cnt_Total: 1/5 -> 50/200
- Erase_Fail_Count_Total: 1/5 -> 50/200

Global filtered_issues: removed Seek_Error_Rate and Command_Timeout
(now handled per-profile); Raw_Read_Error_Rate kept as catch-all.

Verified with --dry-run on all 4 servers: compute-storage-01, large1,
compute-storage-gpu-01, pbs. Only legitimate issues surface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-17 10:09:54 -04:00

.gitea/workflows

ci: add notify-failure, pytest with coverage, and 49 unit tests

2026-04-14 16:25:23 -04:00

tests

ci: add notify-failure, pytest with coverage, and 49 unit tests

2026-04-14 16:25:23 -04:00

.coveragerc

ci: add notify-failure, pytest with coverage, and 49 unit tests

2026-04-14 16:25:23 -04:00

.flake8

ci: add flake8 lint workflow; fix unused imports and f-string issues

2026-04-13 22:27:15 -04:00

.gitignore

ci: add notify-failure, pytest with coverage, and 49 unit tests

2026-04-14 16:25:23 -04:00

grafana-dashboard.json

Add Ceph cluster monitoring and Prometheus metrics export

2026-01-17 15:54:16 -05:00

hwmon.service

Fix critical reliability and security issues in hwmonDaemon

2026-01-06 16:55:48 -05:00

hwmon.timer

Fix critical reliability and security issues in hwmonDaemon

2026-01-06 16:55:48 -05:00

hwmonDaemon.py

fix: calibrate SMART thresholds per manufacturer to eliminate false positives

2026-04-17 10:09:54 -04:00

README.md

Add CI badges and CI section to README

2026-04-14 12:54:21 -04:00

README.md

System Health Monitoring Daemon

A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.

Features

Comprehensive system health monitoring:
- Drive health (SMART status and disk usage)
- Memory usage
- CPU utilization
- Network connectivity (Management and Ceph networks)
Automatic ticket creation for detected issues
Configurable thresholds and monitoring parameters
Dry-run mode for testing
Systemd integration for automated daily checks
LXC container storage monitoring
Historical trend analysis for predictive failure detection
Manufacturer-specific SMART attribute interpretation
ECC memory error detection

Installation

Copy the service and timer files to systemd:

sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/

Reload systemd daemon:

sudo systemctl daemon-reload

Enable and start the timer:

sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer

One liner (run as root)

curl -o /etc/systemd/system/hwmon.service http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer

Manual Execution

Direct Execution (from local file)

Run the daemon with dry-run mode to test:

python3 hwmonDaemon.py --dry-run

Run the daemon normally:

python3 hwmonDaemon.py

Remote Execution (same as systemd service)

Execute directly from repository without downloading:

Run with dry-run mode to test:

/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))" --dry-run

Run normally (creates actual tickets):

/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"

Configuration

The daemon monitors:

Disk usage (warns at 80%, critical at 90%)
LXC storage usage (warns at 80%, critical at 90%)
Memory usage (warns at 80%)
CPU usage (warns at 95%)
Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
SMART status of physical drives with manufacturer-specific profiles
Temperature monitoring (warns at 65°C)
Automatic duplicate ticket prevention
Enhanced logging with debug capabilities

Data Storage

The daemon creates and maintains:

Log Directory: /var/log/hwmonDaemon/
Historical SMART Data: JSON files for trend analysis
Data Retention: 30 days of historical monitoring data
Storage Limit: Automatically enforced 10MB maximum
Cleanup: Oldest files deleted first when limit exceeded

Ticket Creation

The daemon automatically creates tickets with:

Standardized titles including hostname, hardware type, and scope
Detailed descriptions of detected issues with drive specifications
Priority levels based on severity (P2-P4)
Proper categorization and status tracking
Executive summaries and technical analysis

Dependencies

Python 3
Required Python packages:
- psutil
- requests
System tools:
- smartmontools (for SMART disk monitoring)
- nvme-cli (for NVMe drive monitoring)

Excluded Paths

The following paths are automatically excluded from monitoring:

/media/*
/mnt/pve/mediafs/*
/opt/metube_downloads
Pattern-based exclusions for media and download directories

Service Configuration

The daemon runs:

Hourly via systemd timer (with 60-second randomized delay)
As root user for hardware access
With automatic restart on failure
5-minute timeout for execution
Logs to systemd journal

Recent Improvements

Version 2.0 (January 2026):

✅ Added 10MB storage limit with automatic cleanup
✅ File locking to prevent race conditions
✅ Disabled monitoring for unreliable Ridata drives
✅ Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
✅ Fixed unchecked regex patterns
✅ Improved error handling throughout
✅ Enhanced systemd service configuration with restart policies

Troubleshooting

# View service logs
sudo journalctl -u hwmon.service -f

# Check service status
sudo systemctl status hwmon.timer

# Manual test run
python3 hwmonDaemon.py --dry-run

Security Note

Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.

CI

Workflow	Purpose	Triggers
`lint.yml`	flake8 on all `.py` files	Every push and PR
`security.yml`	bandit `-ll` (medium+ severity)	Every push, PR, and weekly Monday 6am

Branch protection is enabled on main — the lint check must pass before any PR can merge. Lint config: .flake8 (max-line-length 120, F841/E501 ignored).