Jared Vititoe 7383a0c674 Escape special characters in Prometheus metric labels
Add escape function to sanitize backslashes, double quotes, and newlines
in label values per Prometheus text format spec. Prevents corrupted
metrics output from model names or paths containing these characters.

Resolves #10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:37 -05:00

System Health Monitoring Daemon

A robust system health monitoring daemon that tracks hardware status and automatically creates tickets for detected issues.

Features

  • Comprehensive system health monitoring:
    • Drive health (SMART status and disk usage)
    • Memory usage
    • CPU utilization
    • Network connectivity (Management and Ceph networks)
  • Automatic ticket creation for detected issues
  • Configurable thresholds and monitoring parameters
  • Dry-run mode for testing
  • Systemd integration for automated daily checks
  • LXC container storage monitoring
  • Historical trend analysis for predictive failure detection
  • Manufacturer-specific SMART attribute interpretation
  • ECC memory error detection

Installation

  1. Copy the service and timer files to systemd:
sudo cp hwmon.service /etc/systemd/system/
sudo cp hwmon.timer /etc/systemd/system/
  1. Reload systemd daemon:
sudo systemctl daemon-reload
  1. Enable and start the timer:
sudo systemctl enable hwmon.timer
sudo systemctl start hwmon.timer

One liner (run as root)

curl -o /etc/systemd/system/hwmon.service http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer

Manual Execution

Direct Execution (from local file)

  1. Run the daemon with dry-run mode to test:
python3 hwmonDaemon.py --dry-run
  1. Run the daemon normally:
python3 hwmonDaemon.py

Remote Execution (same as systemd service)

Execute directly from repository without downloading:

  1. Run with dry-run mode to test:
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))" --dry-run
  1. Run normally (creates actual tickets):
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"

Configuration

The daemon monitors:

  • Disk usage (warns at 80%, critical at 90%)
  • LXC storage usage (warns at 80%, critical at 90%)
  • Memory usage (warns at 80%)
  • CPU usage (warns at 95%)
  • Network connectivity to management (10.10.10.1) and Ceph (10.10.90.1) networks
  • SMART status of physical drives with manufacturer-specific profiles
  • Temperature monitoring (warns at 65°C)
  • Automatic duplicate ticket prevention
  • Enhanced logging with debug capabilities

Data Storage

The daemon creates and maintains:

  • Log Directory: /var/log/hwmonDaemon/
  • Historical SMART Data: JSON files for trend analysis
  • Data Retention: 30 days of historical monitoring data
  • Storage Limit: Automatically enforced 10MB maximum
  • Cleanup: Oldest files deleted first when limit exceeded

Ticket Creation

The daemon automatically creates tickets with:

  • Standardized titles including hostname, hardware type, and scope
  • Detailed descriptions of detected issues with drive specifications
  • Priority levels based on severity (P2-P4)
  • Proper categorization and status tracking
  • Executive summaries and technical analysis

Dependencies

  • Python 3
  • Required Python packages:
    • psutil
    • requests
  • System tools:
    • smartmontools (for SMART disk monitoring)
    • nvme-cli (for NVMe drive monitoring)

Excluded Paths

The following paths are automatically excluded from monitoring:

  • /media/*
  • /mnt/pve/mediafs/*
  • /opt/metube_downloads
  • Pattern-based exclusions for media and download directories

Service Configuration

The daemon runs:

  • Hourly via systemd timer (with 60-second randomized delay)
  • As root user for hardware access
  • With automatic restart on failure
  • 5-minute timeout for execution
  • Logs to systemd journal

Recent Improvements

Version 2.0 (January 2026):

  • Added 10MB storage limit with automatic cleanup
  • File locking to prevent race conditions
  • Disabled monitoring for unreliable Ridata drives
  • Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
  • Fixed unchecked regex patterns
  • Improved error handling throughout
  • Enhanced systemd service configuration with restart policies

Troubleshooting

# View service logs
sudo journalctl -u hwmon.service -f

# Check service status
sudo systemctl status hwmon.timer

# Manual test run
python3 hwmonDaemon.py --dry-run

Security Note

Ensure proper network security measures are in place as the service downloads and executes code from a specified URL.

Description
A Python-based system health monitoring daemon that automatically tracks hardware status and creates tickets for detected issues in the LotusGuild Cluster.
Readme 3.4 MiB
Languages
Python 100%