Compare commits

...

17 Commits

Author SHA1 Message Date
d1750ea6cf Add Proxmox Backup Server (PBS) health monitoring support
Monitors ZFS pool status/usage and failed PBS tasks (backup, GC, sync).
Includes configurable thresholds (PBS_ZFS_WARNING/CRITICAL), Prometheus
metrics (hwmon_pbs_*), dry-run summary, issue categorization, and
priority classification. Enabled via PBS_ENABLED=true in .env config.

Fixes: #5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:18:41 -05:00
07782da7b6 Add HTTP health check endpoint on port 9102
Lightweight /health endpoint returns JSON with status, hostname, and
last check timestamp. Runs as daemon thread, activated via --health-server
flag or HEALTH_SERVER_ENABLED=true in .env config.

Fixes: #21

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:15:15 -05:00
b02e416117 Parallelize SMART health checks across drives with ThreadPoolExecutor
Runs SMART checks concurrently (up to 8 workers) instead of
sequentially, significantly reducing check time on multi-drive systems.
Results are collected and processed in original disk order.

Fixes: #22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:13:50 -05:00
7b36255fb4 Add graceful degradation when external tools are missing
Checks availability of required (smartctl, lsblk) and optional (nvme,
ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code
sections to skip gracefully with informative log messages instead of
crashing. Also fixes pre-existing indentation bug in LXC exception handler.

Fixes: #19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:13:08 -05:00
92bca248ac Add deduplication clarification comments for Ceph ticket handling
Explains that the ticket API deduplicates using SHA-256 hash of
(category + tags + hostname + device), not description/timestamp.
Clarifies the 24-hour dedup window and cluster-wide hostname exclusion.

Fixes: #18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:03:33 -05:00
4a186fb6d6 Create replacement tickets for Ridata drives instead of silently skipping
Ridata drives are known unreliable hardware. Instead of skipping them
with no notification, flag as REPLACEMENT_NEEDED and create tickets
recommending replacement.

Resolves #13

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:01:24 -05:00
90346a2da1 Replace fragile column-index LXC storage parsing with regex
Use regex pattern matching instead of split()[N] indexing for parsing
pct df output. This is more robust against variations in column
formatting and whitespace.

Resolves #11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:00:50 -05:00
308a8d5c5c Cache drive details to eliminate redundant smartctl calls
Add per-run cache for _get_drive_details() results. Each drive is
queried once via smartctl -i and the result is reused across SMART
health checks and ticket creation.

Resolves #15

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:00:25 -05:00
9f9cc1b763 Simplify disk detection to single lsblk call with full paths
Replace dual-method detection (lsblk + glob scanning) with single
lsblk -p call that returns full device paths directly. Adds timeout,
returns sorted results for consistency.

Resolves #14

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:59:49 -05:00
ab67d786ce Increase history storage limit to 50MB to match retention needs
With 50 drives checked hourly over 30 days, history data can reach ~36MB
which exceeded the old 10MB limit causing constant file churn. Increase
to 50MB and make configurable via HISTORY_MAX_BYTES in .env.

Resolves #12

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:59:24 -05:00
da2de4375e Add verbosity control with -v/--verbose flag
Change default log level from DEBUG to INFO to reduce noise during
hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging
when needed for troubleshooting.

Resolves #16

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:58:43 -05:00
38dd120da2 Add config validation for .env values
Wrap all int() conversions in try/except to handle malformed .env values
gracefully. Validate TICKET_API_KEY is not empty or placeholder value,
logging a warning instead of raising to preserve dry-run compatibility.

Resolves #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:58:02 -05:00
7383a0c674 Escape special characters in Prometheus metric labels
Add escape function to sanitize backslashes, double quotes, and newlines
in label values per Prometheus text format spec. Prevents corrupted
metrics output from model names or paths containing these characters.

Resolves #10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:37 -05:00
a3cf5a698f Add missing timeouts to all subprocess calls
Add timeout=30 to smartctl -i calls in _get_drive_details() and
_check_disk_firmware(), and dmidecode in _check_memory_usage().
Add TimeoutExpired handler in _check_disk_firmware(). Prevents
potential hangs when drives or system tools become unresponsive.

Resolves #9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:17 -05:00
c7309663de Fix NVMe temperature parsing bug for values > 99°C
Replace digits[:2] truncation with regex extraction of complete number.
Previously "123°C" would be parsed as 12 instead of 123.

Resolves #8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:37 -05:00
0559f2d668 Fix file locking race condition in SMART trend analysis
Check both file existence AND size > 0 before opening in r+ mode.
Previously, an empty file (0 bytes) would be opened in r+ mode, causing
json.load() to fail on the empty content.

Resolves #7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:21 -05:00
d79005eb42 Centralize hardcoded magic numbers into CONFIG dict
Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h)
from inline literals to configurable CONFIG entries with .env support.

Resolves #20

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:00 -05:00

View File

@@ -1,15 +1,16 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
import os, sys, json, requests, psutil, socket, subprocess, logging, argparse, urllib.request, re, glob, datetime, fcntl, textwrap import os, sys, json, requests, psutil, socket, subprocess, logging, argparse, urllib.request, re, glob, datetime, fcntl, textwrap, shutil
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Dict, Any, List from typing import Dict, Any, List
# ============================================================================= # =============================================================================
# LOGGING SETUP # LOGGING SETUP
# ============================================================================= # =============================================================================
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG) logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler() console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG) console_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter) console_handler.setFormatter(formatter)
@@ -78,7 +79,16 @@ class SystemHealthMonitor:
'CEPH_USAGE_CRITICAL': PRIORITIES['HIGH'], # P2 - Cluster near full 'CEPH_USAGE_CRITICAL': PRIORITIES['HIGH'], # P2 - Cluster near full
'CEPH_USAGE_WARNING': PRIORITIES['MEDIUM'], # P3 - Cluster usage high 'CEPH_USAGE_WARNING': PRIORITIES['MEDIUM'], # P3 - Cluster usage high
'CEPH_PG_DEGRADED': PRIORITIES['HIGH'], # P2 - PGs degraded 'CEPH_PG_DEGRADED': PRIORITIES['HIGH'], # P2 - PGs degraded
'CEPH_MON_DOWN': PRIORITIES['HIGH'] # P2 - Monitor down 'CEPH_MON_DOWN': PRIORITIES['HIGH'], # P2 - Monitor down
# PBS (Proxmox Backup Server) issues
'PBS_ZFS_DEGRADED': PRIORITIES['CRITICAL'], # P1 - ZFS pool degraded
'PBS_ZFS_USAGE_CRITICAL': PRIORITIES['HIGH'], # P2 - ZFS pool near full
'PBS_ZFS_USAGE_WARNING': PRIORITIES['MEDIUM'],# P3 - ZFS pool usage high
'PBS_ZFS_ERRORS': PRIORITIES['HIGH'], # P2 - ZFS pool has errors
'PBS_BACKUP_FAILED': PRIORITIES['HIGH'], # P2 - Backup job failed
'PBS_GC_FAILED': PRIORITIES['MEDIUM'], # P3 - Garbage collection failed
'PBS_SYNC_FAILED': PRIORITIES['MEDIUM'] # P3 - Sync job failed
} }
CONFIG = { CONFIG = {
@@ -124,7 +134,19 @@ class SystemHealthMonitor:
# Prometheus metrics settings # Prometheus metrics settings
'PROMETHEUS_ENABLED': False, # Enable Prometheus metrics export 'PROMETHEUS_ENABLED': False, # Enable Prometheus metrics export
'PROMETHEUS_PORT': 9101, # Port for Prometheus metrics HTTP server 'PROMETHEUS_PORT': 9101, # Port for Prometheus metrics HTTP server
'PROMETHEUS_TEXTFILE_PATH': None # Path for textfile collector (alternative to HTTP) 'PROMETHEUS_TEXTFILE_PATH': None, # Path for textfile collector (alternative to HTTP)
# SMART analysis thresholds
'NEW_DRIVE_HOURS_THRESHOLD': 720, # Hours to consider a drive "new" (~30 days)
'SMART_ERROR_RECENT_HOURS': 168, # Hours window for recent SMART errors (~1 week)
# Storage limits
'HISTORY_MAX_BYTES': 52428800, # 50MB max storage for history files
# Health check endpoint
'HEALTH_SERVER_ENABLED': False, # Enable HTTP health check endpoint
'HEALTH_SERVER_PORT': 9102, # Port for health check endpoint
# PBS (Proxmox Backup Server) monitoring
'PBS_ENABLED': False, # Enable PBS health monitoring
'PBS_ZFS_WARNING': 80, # ZFS pool usage warning threshold %
'PBS_ZFS_CRITICAL': 90 # ZFS pool usage critical threshold %
} }
@classmethod @classmethod
@@ -166,15 +188,24 @@ class SystemHealthMonitor:
cls.CONFIG['CEPH_TICKET_NODE'] = value if value else None cls.CONFIG['CEPH_TICKET_NODE'] = value if value else None
logger.info(f"✓ Loaded CEPH_TICKET_NODE: {value}") logger.info(f"✓ Loaded CEPH_TICKET_NODE: {value}")
elif key == 'CEPH_USAGE_WARNING': elif key == 'CEPH_USAGE_WARNING':
try:
cls.CONFIG['CEPH_USAGE_WARNING'] = int(value) cls.CONFIG['CEPH_USAGE_WARNING'] = int(value)
except ValueError:
logger.warning(f"Invalid CEPH_USAGE_WARNING value: {value}")
elif key == 'CEPH_USAGE_CRITICAL': elif key == 'CEPH_USAGE_CRITICAL':
try:
cls.CONFIG['CEPH_USAGE_CRITICAL'] = int(value) cls.CONFIG['CEPH_USAGE_CRITICAL'] = int(value)
except ValueError:
logger.warning(f"Invalid CEPH_USAGE_CRITICAL value: {value}")
# Prometheus settings # Prometheus settings
elif key == 'PROMETHEUS_ENABLED': elif key == 'PROMETHEUS_ENABLED':
cls.CONFIG['PROMETHEUS_ENABLED'] = value.lower() in ('true', '1', 'yes') cls.CONFIG['PROMETHEUS_ENABLED'] = value.lower() in ('true', '1', 'yes')
logger.info(f"✓ Loaded PROMETHEUS_ENABLED: {cls.CONFIG['PROMETHEUS_ENABLED']}") logger.info(f"✓ Loaded PROMETHEUS_ENABLED: {cls.CONFIG['PROMETHEUS_ENABLED']}")
elif key == 'PROMETHEUS_PORT': elif key == 'PROMETHEUS_PORT':
try:
cls.CONFIG['PROMETHEUS_PORT'] = int(value) cls.CONFIG['PROMETHEUS_PORT'] = int(value)
except ValueError:
logger.warning(f"Invalid PROMETHEUS_PORT value: {value}")
elif key == 'PROMETHEUS_TEXTFILE_PATH': elif key == 'PROMETHEUS_TEXTFILE_PATH':
cls.CONFIG['PROMETHEUS_TEXTFILE_PATH'] = value if value else None cls.CONFIG['PROMETHEUS_TEXTFILE_PATH'] = value if value else None
logger.info(f"✓ Loaded PROMETHEUS_TEXTFILE_PATH: {value}") logger.info(f"✓ Loaded PROMETHEUS_TEXTFILE_PATH: {value}")
@@ -182,10 +213,53 @@ class SystemHealthMonitor:
elif key == 'CLUSTER_NAME': elif key == 'CLUSTER_NAME':
cls.CONFIG['CLUSTER_NAME'] = value if value else 'proxmox-cluster' cls.CONFIG['CLUSTER_NAME'] = value if value else 'proxmox-cluster'
logger.info(f"✓ Loaded CLUSTER_NAME: {value}") logger.info(f"✓ Loaded CLUSTER_NAME: {value}")
elif key == 'NEW_DRIVE_HOURS_THRESHOLD':
try:
cls.CONFIG['NEW_DRIVE_HOURS_THRESHOLD'] = int(value)
except ValueError:
logger.warning(f"Invalid NEW_DRIVE_HOURS_THRESHOLD value: {value}")
elif key == 'SMART_ERROR_RECENT_HOURS':
try:
cls.CONFIG['SMART_ERROR_RECENT_HOURS'] = int(value)
except ValueError:
logger.warning(f"Invalid SMART_ERROR_RECENT_HOURS value: {value}")
elif key == 'HISTORY_MAX_BYTES':
try:
cls.CONFIG['HISTORY_MAX_BYTES'] = int(value)
except ValueError:
logger.warning(f"Invalid HISTORY_MAX_BYTES value: {value}")
# PBS settings
elif key == 'PBS_ENABLED':
cls.CONFIG['PBS_ENABLED'] = value.lower() in ('true', '1', 'yes')
logger.info(f"✓ Loaded PBS_ENABLED: {cls.CONFIG['PBS_ENABLED']}")
elif key == 'PBS_ZFS_WARNING':
try:
cls.CONFIG['PBS_ZFS_WARNING'] = int(value)
except ValueError:
logger.warning(f"Invalid PBS_ZFS_WARNING value: {value}")
elif key == 'PBS_ZFS_CRITICAL':
try:
cls.CONFIG['PBS_ZFS_CRITICAL'] = int(value)
except ValueError:
logger.warning(f"Invalid PBS_ZFS_CRITICAL value: {value}")
# Health server settings
elif key == 'HEALTH_SERVER_ENABLED':
cls.CONFIG['HEALTH_SERVER_ENABLED'] = value.lower() in ('true', '1', 'yes')
logger.info(f"✓ Loaded HEALTH_SERVER_ENABLED: {cls.CONFIG['HEALTH_SERVER_ENABLED']}")
elif key == 'HEALTH_SERVER_PORT':
try:
cls.CONFIG['HEALTH_SERVER_PORT'] = int(value)
except ValueError:
logger.warning(f"Invalid HEALTH_SERVER_PORT value: {value}")
except Exception as e: except Exception as e:
logger.error(f"Failed to load .env file: {e}") logger.error(f"Failed to load .env file: {e}")
# Validate critical configuration
api_key = cls.CONFIG.get('TICKET_API_KEY')
if not api_key or api_key == 'your_api_key_here':
logger.warning("TICKET_API_KEY is not configured - ticket creation will fail (dry-run will still work)")
TICKET_TEMPLATES = { TICKET_TEMPLATES = {
'ACTION_TYPE': { 'ACTION_TYPE': {
'AUTO': '[auto]', 'AUTO': '[auto]',
@@ -607,13 +681,21 @@ class SystemHealthMonitor:
# ============================================================================= # =============================================================================
# INITIALIZATION # INITIALIZATION
# ============================================================================= # =============================================================================
def __init__(self, ticket_api_url: str = 'http://10.10.10.45/create_ticket_api.php', dry_run: bool = False): def __init__(self, ticket_api_url: str = 'http://10.10.10.45/create_ticket_api.php', dry_run: bool = False, verbose: bool = False):
""" """
Initialize the system health monitor. Initialize the system health monitor.
:param ticket_api_url: URL for the ticket creation API. :param ticket_api_url: URL for the ticket creation API.
:param dry_run: If True, simulate API calls without sending requests. :param dry_run: If True, simulate API calls without sending requests.
:param verbose: If True, enable DEBUG-level logging output.
""" """
# Set log verbosity
if verbose:
logger.setLevel(logging.DEBUG)
for handler in logger.handlers:
handler.setLevel(logging.DEBUG)
logger.debug("Verbose logging enabled")
# Load environment configuration first (API keys, etc.) # Load environment configuration first (API keys, etc.)
self.load_env_config() self.load_env_config()
@@ -623,13 +705,59 @@ class SystemHealthMonitor:
# Ensure history directory exists # Ensure history directory exists
os.makedirs(self.CONFIG['HISTORY_DIR'], exist_ok=True) os.makedirs(self.CONFIG['HISTORY_DIR'], exist_ok=True)
def _enforce_storage_limit(self, history_dir: str, max_bytes: int = 10485760): # Drive details cache (per-run, cleared on next execution)
self._drive_details_cache = {}
# Health check tracking
self._last_check_timestamp = None
self._last_check_status = 'unknown'
# Check tool availability at startup
self._available_tools = self._check_tool_availability()
def _check_tool_availability(self) -> Dict[str, bool]:
"""Check which external tools are available on this system.
Returns a dict mapping tool names to availability booleans.
Logs warnings for missing required tools and info for missing optional tools.
""" """
Delete oldest history files if directory exceeds size limit (default 10MB). required_tools = {
'smartctl': 'smartmontools',
'lsblk': 'util-linux',
}
optional_tools = {
'nvme': 'nvme-cli',
'ceph': 'ceph-common',
'pct': 'pve-container',
'dmidecode': 'dmidecode',
'proxmox-backup-manager': 'proxmox-backup-server',
'zpool': 'zfsutils-linux',
}
availability = {}
for tool, package in required_tools.items():
available = shutil.which(tool) is not None
availability[tool] = available
if not available:
logger.warning(f"Required tool '{tool}' not found (install: apt install {package})")
for tool, package in optional_tools.items():
available = shutil.which(tool) is not None
availability[tool] = available
if not available:
logger.debug(f"Optional tool '{tool}' not found (install: apt install {package})")
return availability
def _enforce_storage_limit(self, history_dir: str, max_bytes: int = None):
"""
Delete oldest history files if directory exceeds size limit.
:param history_dir: Directory containing history files :param history_dir: Directory containing history files
:param max_bytes: Maximum directory size in bytes (default 10MB) :param max_bytes: Maximum directory size in bytes (default from CONFIG)
""" """
if max_bytes is None:
max_bytes = self.CONFIG.get('HISTORY_MAX_BYTES', 52428800)
if not os.path.exists(history_dir): if not os.path.exists(history_dir):
return return
@@ -668,6 +796,45 @@ class SystemHealthMonitor:
except Exception as e: except Exception as e:
logger.error(f"Error enforcing storage limit: {e}") logger.error(f"Error enforcing storage limit: {e}")
# =============================================================================
# HEALTH CHECK ENDPOINT
# =============================================================================
def _start_health_server(self):
"""Start a lightweight HTTP health check endpoint as a daemon thread."""
from http.server import HTTPServer, BaseHTTPRequestHandler
import threading
monitor = self
class HealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/health':
response = {
'status': monitor._last_check_status,
'hostname': socket.gethostname(),
'last_check': monitor._last_check_timestamp,
'uptime': datetime.datetime.now().isoformat()
}
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps(response).encode())
else:
self.send_response(404)
self.end_headers()
def log_message(self, format, *args):
logger.debug(f"Health server: {format % args}")
port = self.CONFIG.get('HEALTH_SERVER_PORT', 9102)
try:
server = HTTPServer(('', port), HealthHandler)
thread = threading.Thread(target=server.serve_forever, daemon=True)
thread.start()
logger.info(f"Health check endpoint started on port {port}")
except OSError as e:
logger.warning(f"Could not start health server on port {port}: {e}")
# ============================================================================= # =============================================================================
# MAIN EXECUTION METHODS # MAIN EXECUTION METHODS
# ============================================================================= # =============================================================================
@@ -677,6 +844,10 @@ class SystemHealthMonitor:
# Perform health checks and gather the report # Perform health checks and gather the report
health_report = self.perform_health_checks() health_report = self.perform_health_checks()
# Track last check for health endpoint
self._last_check_timestamp = datetime.datetime.now().isoformat()
self._last_check_status = health_report.get('drives_health', {}).get('overall_status', 'unknown')
# Create tickets for any detected critical issues # Create tickets for any detected critical issues
self._create_tickets_for_issues(health_report) self._create_tickets_for_issues(health_report)
@@ -699,7 +870,8 @@ class SystemHealthMonitor:
'network_health': self._check_network_status(), 'network_health': self._check_network_status(),
'ceph_health': self._check_ceph_health(), 'ceph_health': self._check_ceph_health(),
'lxc_health': self._check_lxc_storage(), 'lxc_health': self._check_lxc_storage(),
'system_health': self._check_system_drive_indicators() 'system_health': self._check_system_drive_indicators(),
'pbs_health': self._check_pbs_health()
} }
if self.dry_run: if self.dry_run:
@@ -755,6 +927,17 @@ class SystemHealthMonitor:
if health_report['system_health']['issues']: if health_report['system_health']['issues']:
logger.info(f"\nSystem Issues: {len(health_report['system_health']['issues'])} found") logger.info(f"\nSystem Issues: {len(health_report['system_health']['issues'])} found")
# PBS status
pbs = health_report.get('pbs_health', {})
if pbs.get('is_pbs_node'):
logger.info("\nPBS Status:")
for pool in pbs.get('zfs_pools', []):
logger.info(f" ZFS Pool '{pool['name']}': {pool['usage_percent']}% used ({pool['used']}/{pool['total']})")
if pbs.get('failed_tasks'):
logger.info(f" Failed tasks: {len(pbs['failed_tasks'])}")
if pbs.get('issues'):
logger.info(f" Issues: {len(pbs['issues'])}")
logger.info("\n=== End Summary ===") logger.info("\n=== End Summary ===")
return health_report return health_report
@@ -776,7 +959,10 @@ class SystemHealthMonitor:
# Load historical data with file locking # Load historical data with file locking
history = [] history = []
file_mode = 'r+' if os.path.exists(historical_file) else 'w+' if os.path.exists(historical_file) and os.path.getsize(historical_file) > 0:
file_mode = 'r+'
else:
file_mode = 'w+'
with open(historical_file, file_mode) as f: with open(historical_file, file_mode) as f:
# Acquire exclusive lock # Acquire exclusive lock
@@ -1025,7 +1211,10 @@ class SystemHealthMonitor:
# DRIVE HEALTH CHECKING METHODS # DRIVE HEALTH CHECKING METHODS
# ============================================================================= # =============================================================================
def _get_drive_details(self, device: str) -> Dict[str, str]: def _get_drive_details(self, device: str) -> Dict[str, str]:
"""Get detailed drive information using smartctl.""" """Get detailed drive information using smartctl (cached per run)."""
if device in self._drive_details_cache:
return self._drive_details_cache[device]
drive_details = { drive_details = {
'model': None, 'model': None,
'serial': None, 'serial': None,
@@ -1041,7 +1230,8 @@ class SystemHealthMonitor:
['smartctl', '-i', device], ['smartctl', '-i', device],
stdout=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, stderr=subprocess.PIPE,
text=True text=True,
timeout=30
) )
# Check if smartctl failed completely # Check if smartctl failed completely
@@ -1079,6 +1269,7 @@ class SystemHealthMonitor:
except Exception as e: except Exception as e:
logger.debug(f"Error getting drive details for {device}: {e}") logger.debug(f"Error getting drive details for {device}: {e}")
self._drive_details_cache[device] = drive_details
return drive_details return drive_details
@@ -1501,7 +1692,9 @@ class SystemHealthMonitor:
'critical reallocated', 'critical current_pending', 'critical reallocated', 'critical current_pending',
'network is unreachable', 'network is unreachable',
'osd is down', 'osd down', # Ceph OSD down 'osd is down', 'osd down', # Ceph OSD down
'cluster usage critical' # Ceph usage critical 'cluster usage critical', # Ceph usage critical
'zfs pool', 'backup failed', # PBS critical issues
'usage critical' # PBS ZFS critical usage
]): ]):
return self.PRIORITIES['HIGH'] # P2 return self.PRIORITIES['HIGH'] # P2
@@ -1520,7 +1713,8 @@ class SystemHealthMonitor:
'warning', 'high temperature', 'correctable ecc', 'warning', 'high temperature', 'correctable ecc',
'trend alert', 'critical storage usage', 'trend alert', 'critical storage usage',
'low available_spare', 'high wear', 'low available_spare', 'high wear',
'health_warn', 'cluster usage warning' # Ceph warnings 'health_warn', 'cluster usage warning', # Ceph warnings
'gc failed', 'sync failed', 'usage high' # PBS warnings
]): ]):
return self.PRIORITIES['MEDIUM'] # P3 return self.PRIORITIES['MEDIUM'] # P3
@@ -1644,6 +1838,27 @@ class SystemHealthMonitor:
self.TICKET_TEMPLATES['TICKET_TYPE']['PROBLEM'] self.TICKET_TEMPLATES['TICKET_TYPE']['PROBLEM']
) )
# PBS Issues - Backup server issues (categorized as Hardware for storage, Software for tasks)
if any(keyword in issue_lower for keyword in [
'pbs', 'zfs pool', 'backup failed', 'gc failed', 'sync failed'
]):
if any(error in issue_lower for error in [
'degraded', 'critical', 'failed', 'errors'
]):
return (
self.TICKET_CATEGORIES['HARDWARE'],
self.TICKET_TYPES['ISSUE'],
'[pbs]',
self.TICKET_TEMPLATES['TICKET_TYPE']['ISSUE']
)
else:
return (
self.TICKET_CATEGORIES['HARDWARE'],
self.TICKET_TYPES['PROBLEM'],
'[pbs]',
self.TICKET_TEMPLATES['TICKET_TYPE']['PROBLEM']
)
# Default: Hardware Problem (for undefined cases) # Default: Hardware Problem (for undefined cases)
return ( return (
self.TICKET_CATEGORIES['HARDWARE'], self.TICKET_CATEGORIES['HARDWARE'],
@@ -1717,6 +1932,10 @@ class SystemHealthMonitor:
) )
description = self._generate_detailed_description(issue, health_report, priority) description = self._generate_detailed_description(issue, health_report, priority)
# NOTE: The ticket API (create_ticket_api.php) deduplicates using a SHA-256 hash of:
# issue_category + environment_tags + hostname (excluded for [cluster-wide]) + device
# Description content and timestamps are NOT included in the dedup hash.
# The 24-hour dedup window prevents duplicate tickets from multiple nodes or runs.
ticket_payload = { ticket_payload = {
"title": ticket_title, "title": ticket_title,
"description": description, "description": description,
@@ -1776,7 +1995,7 @@ class SystemHealthMonitor:
continue continue
# Only report issues for drives with valid SMART status # Only report issues for drives with valid SMART status
if drive.get('smart_issues') and drive.get('smart_status') in ['HEALTHY', 'UNHEALTHY', 'UNKNOWN']: if drive.get('smart_issues') and drive.get('smart_status') in ['HEALTHY', 'UNHEALTHY', 'UNKNOWN', 'REPLACEMENT_NEEDED']:
# Filter out generic error messages and manufacturer-specific false positives # Filter out generic error messages and manufacturer-specific false positives
filtered_issues = [] filtered_issues = []
for issue in drive['smart_issues']: for issue in drive['smart_issues']:
@@ -1840,7 +2059,8 @@ class SystemHealthMonitor:
designated_node = self.CONFIG.get('CEPH_TICKET_NODE') designated_node = self.CONFIG.get('CEPH_TICKET_NODE')
# Cluster-wide issues: only create tickets from designated node (or first node if not set) # Cluster-wide issues: only create tickets from designated node (or first node if not set)
# The [cluster-wide] tag ensures deduplication in tinker_tickets API # The [cluster-wide] tag + CLUSTER_NAME in ticket title ensures cross-node deduplication
# in the tinker_tickets API (dedup hash excludes hostname for cluster-wide issues)
if ceph_health.get('cluster_wide_issues'): if ceph_health.get('cluster_wide_issues'):
# If no designated node, all nodes can report (API deduplicates) # If no designated node, all nodes can report (API deduplicates)
# If designated node is set, only that node creates tickets # If designated node is set, only that node creates tickets
@@ -1856,6 +2076,12 @@ class SystemHealthMonitor:
for issue in ceph_health['issues']: for issue in ceph_health['issues']:
issues.append(f"[ceph] {issue}") issues.append(f"[ceph] {issue}")
# Check for PBS issues
pbs_health = health_report.get('pbs_health', {})
if pbs_health.get('is_pbs_node') and pbs_health.get('issues'):
for issue in pbs_health['issues']:
issues.append(f"[pbs] {issue.get('issue', str(issue))}")
logger.info("=== Issue Detection Started ===") logger.info("=== Issue Detection Started ===")
logger.info(f"Checking drives: {len(health_report['drives_health']['drives'])} found") logger.info(f"Checking drives: {len(health_report['drives_health']['drives'])} found")
logger.info(f"Memory status: {health_report['memory_health']['status']}") logger.info(f"Memory status: {health_report['memory_health']['status']}")
@@ -1886,42 +2112,26 @@ class SystemHealthMonitor:
# DISK AND STORAGE UTILITY METHODS # DISK AND STORAGE UTILITY METHODS
# ============================================================================= # =============================================================================
def _get_all_disks(self) -> List[str]: def _get_all_disks(self) -> List[str]:
"""Get all physical disks using multiple detection methods.""" """Get all physical disks using lsblk with full device paths."""
disks = set() disks = set()
# Method 1: Use lsblk to get physical disks, excluding virtual devices
try: try:
result = subprocess.run( result = subprocess.run(
['lsblk', '-d', '-n', '-o', 'NAME,TYPE'], ['lsblk', '-d', '-n', '-o', 'NAME,TYPE', '-p'],
stdout=subprocess.PIPE, stdout=subprocess.PIPE,
text=True text=True,
timeout=10
) )
for line in result.stdout.strip().split('\n'): for line in result.stdout.strip().split('\n'):
if line: if line:
parts = line.split() parts = line.split()
if len(parts) >= 2: if len(parts) >= 2 and parts[1] == 'disk' and not parts[0].startswith('/dev/rbd'):
name, device_type = parts[0], parts[1] disks.add(parts[0])
# Only include actual disks, exclude virtual devices logger.debug(f"Physical disks found: {disks}")
if device_type == 'disk' and not name.startswith('rbd'): except subprocess.TimeoutExpired:
disks.add(f"/dev/{name}") logger.error("lsblk timed out during disk detection")
logger.debug(f"Physical disks found via lsblk: {disks}")
except Exception as e: except Exception as e:
logger.debug(f"lsblk detection failed: {e}") logger.error(f"Failed to detect disks: {e}")
return sorted(disks)
# Method 2: Direct device scanning for physical devices only
for pattern in ['/dev/sd[a-z]', '/dev/nvme[0-9]n[0-9]']:
try:
import glob
matches = glob.glob(pattern)
# Filter out partitions (devices ending in numbers for sd*, already filtered for nvme)
if 'sd' in pattern:
matches = [d for d in matches if not d[-1].isdigit()]
disks.update(matches)
logger.debug(f"Disks found via glob {pattern}: {matches}")
except Exception as e:
logger.debug(f"Glob detection failed for {pattern}: {e}")
return list(disks)
def _is_excluded_mount(self, mountpoint: str) -> bool: def _is_excluded_mount(self, mountpoint: str) -> bool:
"""Check if a mountpoint should be excluded from monitoring.""" """Check if a mountpoint should be excluded from monitoring."""
@@ -2055,7 +2265,8 @@ class SystemHealthMonitor:
['smartctl', '-i', device], ['smartctl', '-i', device],
stdout=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, stderr=subprocess.PIPE,
text=True text=True,
timeout=30
) )
model_line = None model_line = None
@@ -2095,6 +2306,8 @@ class SystemHealthMonitor:
logger.debug(f"Known issues: {firmware_info['known_issues']}") logger.debug(f"Known issues: {firmware_info['known_issues']}")
logger.debug("=== End Firmware Check ===\n") logger.debug("=== End Firmware Check ===\n")
except subprocess.TimeoutExpired:
logger.warning(f"smartctl -i timed out for {device}")
except Exception as e: except Exception as e:
firmware_info['known_issues'].append(f"Error checking firmware: {str(e)}") firmware_info['known_issues'].append(f"Error checking firmware: {str(e)}")
@@ -2266,7 +2479,7 @@ class SystemHealthMonitor:
def _is_new_drive(self, power_on_hours: int) -> bool: def _is_new_drive(self, power_on_hours: int) -> bool:
"""Determine if a drive is considered "new" based on power-on hours.""" """Determine if a drive is considered "new" based on power-on hours."""
return power_on_hours < 720 # Less than 1 week of runtime return power_on_hours < self.CONFIG['NEW_DRIVE_HOURS_THRESHOLD']
def _check_smart_health(self, device: str) -> Dict[str, Any]: def _check_smart_health(self, device: str) -> Dict[str, Any]:
"""Enhanced SMART health check with better error handling and predictive analysis.""" """Enhanced SMART health check with better error handling and predictive analysis."""
@@ -2303,12 +2516,13 @@ class SystemHealthMonitor:
smart_health['issues'].append("Unable to read device information") smart_health['issues'].append("Unable to read device information")
return smart_health return smart_health
# Skip Ridata drives entirely - unreliable and being replaced # Ridata drives - known unreliable hardware, flag for replacement
manufacturer = self._detect_manufacturer(drive_details.get('model', '')) manufacturer = self._detect_manufacturer(drive_details.get('model', ''))
if manufacturer == 'Ridata': if manufacturer == 'Ridata':
smart_health['status'] = 'SKIPPED' smart_health['status'] = 'REPLACEMENT_NEEDED'
smart_health['issues'].append("Ridata drive - monitoring disabled (unreliable hardware)") smart_health['severity'] = 'WARNING'
logger.debug(f"Skipping SMART monitoring for Ridata drive {device}") smart_health['issues'].append("Ridata drive detected - known unreliable hardware, replacement recommended")
logger.info(f"Ridata drive {device} flagged for replacement")
return smart_health return smart_health
logger.debug(f"Drive details for {device}: {drive_details}") logger.debug(f"Drive details for {device}: {drive_details}")
@@ -2501,7 +2715,7 @@ class SystemHealthMonitor:
for match in error_matches: for match in error_matches:
error_hour = int(match.group(1)) error_hour = int(match.group(1))
current_hours = smart_health['attributes'].get('Power_On_Hours', 0) current_hours = smart_health['attributes'].get('Power_On_Hours', 0)
if current_hours - error_hour < 168: # Errors within last week if current_hours - error_hour < self.CONFIG['SMART_ERROR_RECENT_HOURS']:
recent_errors.append(match.group(0)) recent_errors.append(match.group(0))
if recent_errors: if recent_errors:
@@ -2540,8 +2754,8 @@ class SystemHealthMonitor:
logger.debug(f"Detected Issues: {smart_health['issues']}") logger.debug(f"Detected Issues: {smart_health['issues']}")
logger.debug("=== End SMART Check ===\n") logger.debug("=== End SMART Check ===\n")
# Special handling for NVMe drives # Special handling for NVMe drives (requires nvme-cli)
if 'nvme' in device: if 'nvme' in device and self._available_tools.get('nvme'):
try: try:
nvme_result = subprocess.run( nvme_result = subprocess.run(
['nvme', 'smart-log', device], ['nvme', 'smart-log', device],
@@ -2564,10 +2778,10 @@ class SystemHealthMonitor:
temp_str = line.split(':')[1].strip() if ':' in line else line.strip() temp_str = line.split(':')[1].strip() if ':' in line else line.strip()
logger.debug(f"Raw temperature string: {temp_str}") logger.debug(f"Raw temperature string: {temp_str}")
# Extract first temperature value more safely # Extract the first complete number from temperature string
digits = ''.join(c for c in temp_str if c.isdigit()) temp_match = re.search(r'(\d+)', temp_str)
if len(digits) >= 2: if temp_match:
temp_value = int(digits[:2]) temp_value = int(temp_match.group(1))
logger.debug(f"Parsed temperature value: {temp_value}") logger.debug(f"Parsed temperature value: {temp_value}")
# Set both temperature fields # Set both temperature fields
@@ -2608,6 +2822,10 @@ class SystemHealthMonitor:
'manufacturer_profile': None 'manufacturer_profile': None
} }
if not self._available_tools.get('nvme'):
logger.debug(f"nvme-cli not available, skipping NVMe health check for {device}")
return smart_health
try: try:
# Use nvme-cli for NVMe devices # Use nvme-cli for NVMe devices
result = subprocess.run( result = subprocess.run(
@@ -2674,6 +2892,11 @@ class SystemHealthMonitor:
"""Check health of all drives in the system.""" """Check health of all drives in the system."""
drives_health = {'overall_status': 'NORMAL', 'drives': []} drives_health = {'overall_status': 'NORMAL', 'drives': []}
if not self._available_tools.get('smartctl') or not self._available_tools.get('lsblk'):
logger.warning("Drive health checks skipped: smartctl or lsblk not available")
drives_health['overall_status'] = 'UNKNOWN'
return drives_health
try: try:
# Get only valid physical disks # Get only valid physical disks
physical_disks = self._get_all_disks() physical_disks = self._get_all_disks()
@@ -2698,6 +2921,20 @@ class SystemHealthMonitor:
device_partitions[base_dev] = [] device_partitions[base_dev] = []
device_partitions[base_dev].append(part) device_partitions[base_dev].append(part)
# Run SMART checks in parallel across all drives
smart_results = {}
max_workers = min(8, len(physical_disks))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(self._check_smart_health, disk): disk for disk in physical_disks}
for future in as_completed(futures):
disk = futures[future]
try:
smart_results[disk] = future.result()
except Exception as e:
logger.error(f"SMART check failed for {disk}: {e}")
smart_results[disk] = {'status': 'ERROR', 'issues': [str(e)], 'temp': None, 'attributes': {}}
# Build drive reports in original disk order
overall_status = 'NORMAL' overall_status = 'NORMAL'
for disk in physical_disks: for disk in physical_disks:
drive_report = { drive_report = {
@@ -2733,8 +2970,8 @@ class SystemHealthMonitor:
if total_space > 0: if total_space > 0:
drive_report['usage_percent'] = (total_used / total_space) * 100 drive_report['usage_percent'] = (total_used / total_space) * 100
# Check SMART health # Use pre-fetched SMART results
smart_health = self._check_smart_health(disk) smart_health = smart_results.get(disk, {'status': 'ERROR', 'issues': [], 'temp': None, 'attributes': {}})
drive_report.update({ drive_report.update({
'smart_status': smart_health['status'], 'smart_status': smart_health['status'],
'smart_issues': smart_health['issues'], 'smart_issues': smart_health['issues'],
@@ -2798,17 +3035,19 @@ class SystemHealthMonitor:
} }
try: try:
# First check using dmidecode # First check using dmidecode (if available)
if self._available_tools.get('dmidecode'):
result = subprocess.run( result = subprocess.run(
['dmidecode', '--type', 'memory'], ['dmidecode', '--type', 'memory'],
stdout=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, stderr=subprocess.PIPE,
text=True text=True,
timeout=30
) )
if 'Error Correction Type: Multi-bit ECC' in result.stdout: if 'Error Correction Type: Multi-bit ECC' in result.stdout:
memory_health['has_ecc'] = True memory_health['has_ecc'] = True
# If dmidecode didn't find ECC, try the edac method as backup # If dmidecode unavailable or didn't find ECC, try the edac method as backup
if not memory_health['has_ecc']: if not memory_health['has_ecc']:
edac_path = '/sys/devices/system/edac/mc' edac_path = '/sys/devices/system/edac/mc'
if os.path.exists(edac_path) and os.listdir(edac_path): if os.path.exists(edac_path) and os.listdir(edac_path):
@@ -2945,8 +3184,6 @@ class SystemHealthMonitor:
Returns health status, cluster info, and any issues detected. Returns health status, cluster info, and any issues detected.
Cluster-wide issues use [cluster-wide] tag for cross-node deduplication. Cluster-wide issues use [cluster-wide] tag for cross-node deduplication.
""" """
import shutil
ceph_health = { ceph_health = {
'status': 'OK', 'status': 'OK',
'is_ceph_node': False, 'is_ceph_node': False,
@@ -2955,7 +3192,7 @@ class SystemHealthMonitor:
'osd_status': [], 'osd_status': [],
'mon_status': [], 'mon_status': [],
'issues': [], 'issues': [],
'cluster_wide_issues': [] # Issues that apply to entire cluster 'cluster_wide_issues': [] # Issues affecting entire cluster; use CLUSTER_NAME for dedup
} }
# Check if Ceph monitoring is enabled # Check if Ceph monitoring is enabled
@@ -2964,7 +3201,7 @@ class SystemHealthMonitor:
return ceph_health return ceph_health
# Check if ceph CLI is available # Check if ceph CLI is available
if not shutil.which('ceph'): if not self._available_tools.get('ceph'):
logger.debug("Ceph CLI not found - not a Ceph node") logger.debug("Ceph CLI not found - not a Ceph node")
return ceph_health return ceph_health
@@ -3131,6 +3368,186 @@ class SystemHealthMonitor:
return ceph_health return ceph_health
# =============================================================================
# PBS (PROXMOX BACKUP SERVER) HEALTH CHECKS
# =============================================================================
def _check_pbs_health(self) -> Dict[str, Any]:
"""
Check Proxmox Backup Server health including ZFS pools and task status.
Returns health status for ZFS pools, failed backup/GC/sync jobs.
Only active when PBS_ENABLED=true and relevant tools are available.
"""
pbs_health = {
'status': 'OK',
'is_pbs_node': False,
'zfs_pools': [],
'failed_tasks': [],
'issues': []
}
if not self.CONFIG.get('PBS_ENABLED', False):
logger.debug("PBS monitoring disabled in config")
return pbs_health
if not self._available_tools.get('zpool'):
logger.debug("zpool not available - skipping PBS ZFS checks")
return pbs_health
pbs_health['is_pbs_node'] = True
# Check ZFS pool status
try:
result = subprocess.run(
['zpool', 'status', '-p'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=30
)
if result.returncode == 0:
current_pool = None
for line in result.stdout.splitlines():
line_stripped = line.strip()
if line_stripped.startswith('pool:'):
current_pool = line_stripped.split(':', 1)[1].strip()
elif line_stripped.startswith('state:') and current_pool:
state = line_stripped.split(':', 1)[1].strip()
if state != 'ONLINE':
pbs_health['status'] = 'CRITICAL'
pbs_health['issues'].append({
'type': 'PBS_ZFS_DEGRADED',
'severity': 'CRITICAL',
'device': current_pool,
'issue': f"ZFS pool '{current_pool}' state: {state}"
})
elif line_stripped.startswith('errors:') and current_pool:
if 'No known data errors' not in line_stripped:
pbs_health['issues'].append({
'type': 'PBS_ZFS_ERRORS',
'severity': 'WARNING',
'device': current_pool,
'issue': f"ZFS pool '{current_pool}' has errors: {line_stripped}"
})
except subprocess.TimeoutExpired:
logger.warning("zpool status timed out")
except Exception as e:
logger.error(f"Error checking ZFS pool status: {e}")
# Check ZFS pool usage
try:
result = subprocess.run(
['zpool', 'list', '-Hp'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=30
)
if result.returncode == 0:
for line in result.stdout.splitlines():
parts = line.split('\t')
if len(parts) >= 8:
pool_name = parts[0]
try:
total_bytes = int(parts[1])
used_bytes = int(parts[2])
usage_pct = (used_bytes / total_bytes * 100) if total_bytes > 0 else 0
except (ValueError, ZeroDivisionError):
continue
pool_info = {
'name': pool_name,
'total': self._convert_bytes(total_bytes),
'used': self._convert_bytes(used_bytes),
'usage_percent': round(usage_pct, 1),
'health': parts[9] if len(parts) > 9 else 'UNKNOWN'
}
pbs_health['zfs_pools'].append(pool_info)
if usage_pct >= self.CONFIG['PBS_ZFS_CRITICAL']:
pbs_health['status'] = 'CRITICAL'
pbs_health['issues'].append({
'type': 'PBS_ZFS_USAGE_CRITICAL',
'severity': 'CRITICAL',
'device': pool_name,
'issue': f"ZFS pool '{pool_name}' usage critical: {usage_pct:.1f}%"
})
elif usage_pct >= self.CONFIG['PBS_ZFS_WARNING']:
if pbs_health['status'] != 'CRITICAL':
pbs_health['status'] = 'WARNING'
pbs_health['issues'].append({
'type': 'PBS_ZFS_USAGE_WARNING',
'severity': 'WARNING',
'device': pool_name,
'issue': f"ZFS pool '{pool_name}' usage high: {usage_pct:.1f}%"
})
except subprocess.TimeoutExpired:
logger.warning("zpool list timed out")
except Exception as e:
logger.error(f"Error checking ZFS pool usage: {e}")
# Check failed PBS tasks (requires proxmox-backup-manager)
if self._available_tools.get('proxmox-backup-manager'):
try:
result = subprocess.run(
['proxmox-backup-manager', 'task', 'list', '--output-format', 'json'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=30
)
if result.returncode == 0:
try:
tasks = json.loads(result.stdout)
for task in tasks:
task_status = task.get('status', '')
task_type = task.get('worker_type', '')
task_id = task.get('worker_id', '')
if task_status and task_status != 'OK':
failed_task = {
'type': task_type,
'id': task_id,
'status': task_status,
'starttime': task.get('starttime', ''),
'endtime': task.get('endtime', '')
}
pbs_health['failed_tasks'].append(failed_task)
# Categorize by task type
if 'backup' in task_type.lower():
issue_type = 'PBS_BACKUP_FAILED'
severity = 'CRITICAL'
elif 'gc' in task_type.lower() or 'garbage' in task_type.lower():
issue_type = 'PBS_GC_FAILED'
severity = 'WARNING'
elif 'sync' in task_type.lower():
issue_type = 'PBS_SYNC_FAILED'
severity = 'WARNING'
else:
issue_type = 'PBS_BACKUP_FAILED'
severity = 'WARNING'
pbs_health['issues'].append({
'type': issue_type,
'severity': severity,
'device': f"task-{task_type}",
'issue': f"PBS {task_type} failed: {task_id} - {task_status}"
})
if severity == 'CRITICAL':
pbs_health['status'] = 'CRITICAL'
elif pbs_health['status'] == 'OK':
pbs_health['status'] = 'WARNING'
except json.JSONDecodeError as e:
logger.warning(f"Failed to parse PBS task list JSON: {e}")
except subprocess.TimeoutExpired:
logger.warning("proxmox-backup-manager task list timed out")
except Exception as e:
logger.error(f"Error checking PBS tasks: {e}")
return pbs_health
# ============================================================================= # =============================================================================
# PROMETHEUS METRICS EXPORT # PROMETHEUS METRICS EXPORT
# ============================================================================= # =============================================================================
@@ -3148,9 +3565,11 @@ class SystemHealthMonitor:
hostname = health_report.get('hostname', socket.gethostname()) hostname = health_report.get('hostname', socket.gethostname())
metrics = [] metrics = []
# Helper to format labels # Helper to format labels with proper Prometheus escaping
def labels(**kwargs) -> str: def labels(**kwargs) -> str:
pairs = [f'{k}="{v}"' for k, v in kwargs.items() if v is not None] def escape(value):
return str(value).replace('\\', '\\\\').replace('"', '\\"').replace('\n', '\\n')
pairs = [f'{k}="{escape(v)}"' for k, v in kwargs.items() if v is not None]
return '{' + ','.join(pairs) + '}' if pairs else '' return '{' + ','.join(pairs) + '}' if pairs else ''
# === System Info === # === System Info ===
@@ -3275,6 +3694,18 @@ class SystemHealthMonitor:
usage = fs.get('usage_percent', 0) usage = fs.get('usage_percent', 0)
metrics.append(f'hwmon_lxc_storage_usage_percent{labels(hostname=hostname, vmid=vmid, mountpoint=mountpoint)} {usage}') metrics.append(f'hwmon_lxc_storage_usage_percent{labels(hostname=hostname, vmid=vmid, mountpoint=mountpoint)} {usage}')
# === PBS Metrics ===
pbs = health_report.get('pbs_health', {})
if pbs.get('is_pbs_node'):
metrics.append(f'# HELP hwmon_pbs_zfs_usage_percent PBS ZFS pool usage percentage')
metrics.append(f'# TYPE hwmon_pbs_zfs_usage_percent gauge')
for pool in pbs.get('zfs_pools', []):
metrics.append(f'hwmon_pbs_zfs_usage_percent{labels(hostname=hostname, pool=pool["name"])} {pool["usage_percent"]}')
metrics.append(f'# HELP hwmon_pbs_failed_tasks_total PBS failed task count')
metrics.append(f'# TYPE hwmon_pbs_failed_tasks_total gauge')
metrics.append(f'hwmon_pbs_failed_tasks_total{labels(hostname=hostname)} {len(pbs.get("failed_tasks", []))}')
# === Issue Summary Metrics === # === Issue Summary Metrics ===
metrics.append(f'# HELP hwmon_issues_total Total number of issues detected') metrics.append(f'# HELP hwmon_issues_total Total number of issues detected')
metrics.append(f'# TYPE hwmon_issues_total gauge') metrics.append(f'# TYPE hwmon_issues_total gauge')
@@ -3282,7 +3713,8 @@ class SystemHealthMonitor:
system_issues = len(health_report.get('system_health', {}).get('issues', [])) system_issues = len(health_report.get('system_health', {}).get('issues', []))
ceph_issues = len(ceph.get('issues', [])) + len(ceph.get('cluster_wide_issues', [])) ceph_issues = len(ceph.get('issues', [])) + len(ceph.get('cluster_wide_issues', []))
lxc_issues = len(lxc.get('issues', [])) lxc_issues = len(lxc.get('issues', []))
total_issues = system_issues + ceph_issues + lxc_issues pbs_issues = len(pbs.get('issues', []))
total_issues = system_issues + ceph_issues + lxc_issues + pbs_issues
metrics.append(f'hwmon_issues_total{labels(hostname=hostname)} {total_issues}') metrics.append(f'hwmon_issues_total{labels(hostname=hostname)} {total_issues}')
return '\n'.join(metrics) + '\n' return '\n'.join(metrics) + '\n'
@@ -3369,6 +3801,10 @@ class SystemHealthMonitor:
'issues': [] 'issues': []
} }
if not self._available_tools.get('pct'):
logger.debug("pct not available - not a PVE node or pve-container not installed")
return lxc_health
try: try:
result = subprocess.run( result = subprocess.run(
['pct', 'list'], ['pct', 'list'],
@@ -3409,32 +3845,38 @@ class SystemHealthMonitor:
if not fs_line.strip() or 'MP' in fs_line: if not fs_line.strip() or 'MP' in fs_line:
continue continue
# Fix: Use fs_line instead of line, and columns consistently # Parse df output using regex for reliable column extraction
columns = fs_line.split() match = re.match(
r'(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+\.?\d*)%?\s+(.*)',
if len(columns) >= 6: fs_line.strip()
try: )
# Skip excluded mounts by checking the first column if not match:
if columns[0].startswith('appPool:') or '/mnt/pve/mediaf' in columns[1]: logger.debug(f"Could not parse filesystem line: {fs_line}")
continue continue
# Get the mountpoint (last column) pool, device_col, total_str, used_str, avail_str, percent_str, mountpoint = match.groups()
mountpoint = columns[-1]
try:
# Skip excluded mounts
if pool.startswith('appPool:') or '/mnt/pve/mediaf' in device_col:
continue
mountpoint = mountpoint.strip()
# Skip excluded mountpoints # Skip excluded mountpoints
if self._is_excluded_mount(mountpoint): if self._is_excluded_mount(mountpoint):
logger.debug(f"Skipping excluded mount: {mountpoint}") logger.debug(f"Skipping excluded mount: {mountpoint}")
continue continue
# Parse size values safely - use correct column indices # Parse size values from named regex groups
total_space = self._parse_size(columns[2]) # 3rd column total_space = self._parse_size(total_str)
used_space = self._parse_size(columns[3]) # 4th column used_space = self._parse_size(used_str)
available_space = self._parse_size(columns[4]) # 5th column available_space = self._parse_size(avail_str)
# Parse percentage safely # Parse percentage from regex group
try: try:
usage_percent = float(columns[5].rstrip('%')) # 6th column usage_percent = float(percent_str)
except (ValueError, IndexError): except ValueError:
# Calculate percentage if parsing fails # Calculate percentage if parsing fails
usage_percent = (used_space / total_space * 100) if total_space > 0 else 0 usage_percent = (used_space / total_space * 100) if total_space > 0 else 0
@@ -3501,13 +3943,28 @@ def main():
metavar="FILE", metavar="FILE",
help="Export health report to JSON file." help="Export health report to JSON file."
) )
parser.add_argument(
"-v", "--verbose",
action="store_true",
help="Enable verbose (DEBUG) logging output."
)
parser.add_argument(
"--health-server",
action="store_true",
help="Start HTTP health check endpoint (default port 9102)."
)
args = parser.parse_args() args = parser.parse_args()
monitor = SystemHealthMonitor( monitor = SystemHealthMonitor(
ticket_api_url=SystemHealthMonitor.CONFIG['TICKET_API_URL'], ticket_api_url=SystemHealthMonitor.CONFIG['TICKET_API_URL'],
dry_run=args.dry_run dry_run=args.dry_run,
verbose=args.verbose
) )
# Start health server if requested via CLI or .env
if args.health_server or monitor.CONFIG.get('HEALTH_SERVER_ENABLED', False):
monitor._start_health_server()
if args.metrics: if args.metrics:
# Just output metrics to stdout # Just output metrics to stdout
health_report = monitor.perform_health_checks() health_report = monitor.perform_health_checks()