jared 26e2d1cec8
Lint / Python (flake8) (push) Failing after 46s
Security / Python Security (bandit) (push) Successful in 57s
Test / Python Tests (pytest) (push) Successful in 1m29s
Lint / Notify on failure (push) Successful in 3s
Strip volatile SMART counters from ticket title to stop comment spam
Power_On_Hours and other SMART counters embedded in the issue string
were included verbatim in the ticket title. Since the count increments
every hour, the title was "new" on every run, triggering a title-update
comment every single cycle (307 spam comments on two tickets).

Strip ': Warning <attr>: <N>' / ': Critical <attr>: <N>' suffixes from
the title before building the ticket payload. Counter values are still
fully captured in the ticket description.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 08:16:35 -04:00

System Health Monitoring Daemon

Lint Security

A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.

Features

  • Comprehensive system health monitoring:
    • Drive health (SMART status and disk usage)
    • Memory usage
    • CPU utilization
    • Network connectivity (Management and Ceph networks)
  • Automatic ticket creation for detected issues
  • Configurable thresholds and monitoring parameters
  • Dry-run mode for testing
  • Systemd integration for automated daily checks
  • LXC container storage monitoring
  • Historical trend analysis for predictive failure detection
  • Manufacturer-specific SMART attribute interpretation
  • ECC memory error detection

Installation

  1. Copy the service and timer files to systemd:
sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/
  1. Reload systemd daemon:
sudo systemctl daemon-reload
  1. Enable and start the timer:
sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer

One liner (run as root)

curl -o /etc/systemd/system/hwmon.service http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer

Manual Execution

Direct Execution (from local file)

  1. Run the daemon with dry-run mode to test:
python3 hwmonDaemon.py --dry-run
  1. Run the daemon normally:
python3 hwmonDaemon.py

Remote Execution (same as systemd service)

Execute directly from repository without downloading:

  1. Run with dry-run mode to test:
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))" --dry-run
  1. Run normally (creates actual tickets):
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"

Configuration

The daemon monitors:

  • Disk usage (warns at 80%, critical at 90%)
  • LXC storage usage (warns at 80%, critical at 90%)
  • Memory usage (warns at 80%)
  • CPU usage (warns at 95%)
  • Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
  • SMART status of physical drives with manufacturer-specific profiles
  • Temperature monitoring (warns at 65°C)
  • Automatic duplicate ticket prevention
  • Enhanced logging with debug capabilities

Data Storage

The daemon creates and maintains:

  • Log Directory: /var/log/hwmonDaemon/
  • Historical SMART Data: JSON files for trend analysis
  • Data Retention: 30 days of historical monitoring data
  • Storage Limit: Automatically enforced 10MB maximum
  • Cleanup: Oldest files deleted first when limit exceeded

Ticket Creation

The daemon automatically creates tickets with:

  • Standardized titles including hostname, hardware type, and scope
  • Detailed descriptions of detected issues with drive specifications
  • Priority levels based on severity (P2-P4)
  • Proper categorization and status tracking
  • Executive summaries and technical analysis

Dependencies

  • Python 3
  • Required Python packages:
    • psutil
    • requests
  • System tools:
    • smartmontools (for SMART disk monitoring)
    • nvme-cli (for NVMe drive monitoring)

Excluded Paths

The following paths are automatically excluded from monitoring:

  • /media/*
  • /mnt/pve/mediafs/*
  • /opt/metube_downloads
  • Pattern-based exclusions for media and download directories

Service Configuration

The daemon runs:

  • Hourly via systemd timer (with 60-second randomized delay)
  • As root user for hardware access
  • With automatic restart on failure
  • 5-minute timeout for execution
  • Logs to systemd journal

Recent Improvements

Version 2.0 (January 2026):

  • Added 10MB storage limit with automatic cleanup
  • File locking to prevent race conditions
  • Disabled monitoring for unreliable Ridata drives
  • Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
  • Fixed unchecked regex patterns
  • Improved error handling throughout
  • Enhanced systemd service configuration with restart policies

Troubleshooting

# View service logs
sudo journalctl -u hwmon.service -f

# Check service status
sudo systemctl status hwmon.timer

# Manual test run
python3 hwmonDaemon.py --dry-run

Security Note

Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.

CI

Workflow Purpose Triggers
lint.yml flake8 on all .py files Every push and PR
security.yml bandit -ll (medium+ severity) Every push, PR, and weekly Monday 6am

Branch protection is enabled on main — the lint check must pass before any PR can merge. Lint config: .flake8 (max-line-length 120, F841/E501 ignored).

S
Description
A Python-based system health monitoring daemon that automatically tracks hardware status and creates tickets for detected issues in the LotusGuild Cluster.
Readme 3.7 MiB
Languages
Python 100%