Add Proxmox Backup Server (PBS) health monitoring support

Monitors ZFS pool status/usage and failed PBS tasks (backup, GC, sync). Includes configurable thresholds (PBS_ZFS_WARNING/CRITICAL), Prometheus metrics (hwmon_pbs_*), dry-run summary, issue categorization, and priority classification. Enabled via PBS_ENABLED=true in .env config. Fixes: #5 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add HTTP health check endpoint on port 9102
2026-02-10 13:18:41 -05:00 · 2026-02-10 13:15:15 -05:00 · 2026-02-10 13:13:50 -05:00 · 2026-02-10 13:13:08 -05:00 · 2026-02-10 13:03:33 -05:00 · 2026-02-10 13:01:24 -05:00
1 changed files with 574 additions and 117 deletions
@@ -1,15 +1,16 @@
 #!/usr/bin/env python3
-import os, sys, json, requests, psutil, socket, subprocess, logging, argparse, urllib.request, re, glob, datetime, fcntl, textwrap
+import os, sys, json, requests, psutil, socket, subprocess, logging, argparse, urllib.request, re, glob, datetime, fcntl, textwrap, shutil
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from typing import Dict, Any, List

 # =============================================================================
 # LOGGING SETUP
 # =============================================================================
 logger = logging.getLogger(__name__)
-logger.setLevel(logging.DEBUG)
+logger.setLevel(logging.INFO)

 console_handler = logging.StreamHandler()
-console_handler.setLevel(logging.DEBUG)
+console_handler.setLevel(logging.INFO)

 formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
 console_handler.setFormatter(formatter)
@@ -78,7 +79,16 @@ class SystemHealthMonitor:
        'CEPH_USAGE_CRITICAL': PRIORITIES['HIGH'],    # P2 - Cluster near full
        'CEPH_USAGE_WARNING': PRIORITIES['MEDIUM'],   # P3 - Cluster usage high
        'CEPH_PG_DEGRADED': PRIORITIES['HIGH'],       # P2 - PGs degraded
-        'CEPH_MON_DOWN': PRIORITIES['HIGH']           # P2 - Monitor down
+        'CEPH_MON_DOWN': PRIORITIES['HIGH'],          # P2 - Monitor down
+
+        # PBS (Proxmox Backup Server) issues
+        'PBS_ZFS_DEGRADED': PRIORITIES['CRITICAL'],   # P1 - ZFS pool degraded
+        'PBS_ZFS_USAGE_CRITICAL': PRIORITIES['HIGH'], # P2 - ZFS pool near full
+        'PBS_ZFS_USAGE_WARNING': PRIORITIES['MEDIUM'],# P3 - ZFS pool usage high
+        'PBS_ZFS_ERRORS': PRIORITIES['HIGH'],         # P2 - ZFS pool has errors
+        'PBS_BACKUP_FAILED': PRIORITIES['HIGH'],      # P2 - Backup job failed
+        'PBS_GC_FAILED': PRIORITIES['MEDIUM'],        # P3 - Garbage collection failed
+        'PBS_SYNC_FAILED': PRIORITIES['MEDIUM']       # P3 - Sync job failed
    }
    
    CONFIG = {
@@ -124,7 +134,19 @@ class SystemHealthMonitor:
        # Prometheus metrics settings
        'PROMETHEUS_ENABLED': False,  # Enable Prometheus metrics export
        'PROMETHEUS_PORT': 9101,  # Port for Prometheus metrics HTTP server
-        'PROMETHEUS_TEXTFILE_PATH': None  # Path for textfile collector (alternative to HTTP)
+        'PROMETHEUS_TEXTFILE_PATH': None,  # Path for textfile collector (alternative to HTTP)
+        # SMART analysis thresholds
+        'NEW_DRIVE_HOURS_THRESHOLD': 720,    # Hours to consider a drive "new" (~30 days)
+        'SMART_ERROR_RECENT_HOURS': 168,     # Hours window for recent SMART errors (~1 week)
+        # Storage limits
+        'HISTORY_MAX_BYTES': 52428800,        # 50MB max storage for history files
+        # Health check endpoint
+        'HEALTH_SERVER_ENABLED': False,  # Enable HTTP health check endpoint
+        'HEALTH_SERVER_PORT': 9102,      # Port for health check endpoint
+        # PBS (Proxmox Backup Server) monitoring
+        'PBS_ENABLED': False,            # Enable PBS health monitoring
+        'PBS_ZFS_WARNING': 80,           # ZFS pool usage warning threshold %
+        'PBS_ZFS_CRITICAL': 90           # ZFS pool usage critical threshold %
    }

    @classmethod
@@ -166,15 +188,24 @@ class SystemHealthMonitor:
                            cls.CONFIG['CEPH_TICKET_NODE'] = value if value else None
                            logger.info(f"✓ Loaded CEPH_TICKET_NODE: {value}")
                        elif key == 'CEPH_USAGE_WARNING':
-                            cls.CONFIG['CEPH_USAGE_WARNING'] = int(value)
+                            try:
+                                cls.CONFIG['CEPH_USAGE_WARNING'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid CEPH_USAGE_WARNING value: {value}")
                        elif key == 'CEPH_USAGE_CRITICAL':
-                            cls.CONFIG['CEPH_USAGE_CRITICAL'] = int(value)
+                            try:
+                                cls.CONFIG['CEPH_USAGE_CRITICAL'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid CEPH_USAGE_CRITICAL value: {value}")
                        # Prometheus settings
                        elif key == 'PROMETHEUS_ENABLED':
                            cls.CONFIG['PROMETHEUS_ENABLED'] = value.lower() in ('true', '1', 'yes')
                            logger.info(f"✓ Loaded PROMETHEUS_ENABLED: {cls.CONFIG['PROMETHEUS_ENABLED']}")
                        elif key == 'PROMETHEUS_PORT':
-                            cls.CONFIG['PROMETHEUS_PORT'] = int(value)
+                            try:
+                                cls.CONFIG['PROMETHEUS_PORT'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid PROMETHEUS_PORT value: {value}")
                        elif key == 'PROMETHEUS_TEXTFILE_PATH':
                            cls.CONFIG['PROMETHEUS_TEXTFILE_PATH'] = value if value else None
                            logger.info(f"✓ Loaded PROMETHEUS_TEXTFILE_PATH: {value}")
@@ -182,10 +213,53 @@ class SystemHealthMonitor:
                        elif key == 'CLUSTER_NAME':
                            cls.CONFIG['CLUSTER_NAME'] = value if value else 'proxmox-cluster'
                            logger.info(f"✓ Loaded CLUSTER_NAME: {value}")
+                        elif key == 'NEW_DRIVE_HOURS_THRESHOLD':
+                            try:
+                                cls.CONFIG['NEW_DRIVE_HOURS_THRESHOLD'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid NEW_DRIVE_HOURS_THRESHOLD value: {value}")
+                        elif key == 'SMART_ERROR_RECENT_HOURS':
+                            try:
+                                cls.CONFIG['SMART_ERROR_RECENT_HOURS'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid SMART_ERROR_RECENT_HOURS value: {value}")
+                        elif key == 'HISTORY_MAX_BYTES':
+                            try:
+                                cls.CONFIG['HISTORY_MAX_BYTES'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid HISTORY_MAX_BYTES value: {value}")
+                        # PBS settings
+                        elif key == 'PBS_ENABLED':
+                            cls.CONFIG['PBS_ENABLED'] = value.lower() in ('true', '1', 'yes')
+                            logger.info(f"✓ Loaded PBS_ENABLED: {cls.CONFIG['PBS_ENABLED']}")
+                        elif key == 'PBS_ZFS_WARNING':
+                            try:
+                                cls.CONFIG['PBS_ZFS_WARNING'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid PBS_ZFS_WARNING value: {value}")
+                        elif key == 'PBS_ZFS_CRITICAL':
+                            try:
+                                cls.CONFIG['PBS_ZFS_CRITICAL'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid PBS_ZFS_CRITICAL value: {value}")
+                        # Health server settings
+                        elif key == 'HEALTH_SERVER_ENABLED':
+                            cls.CONFIG['HEALTH_SERVER_ENABLED'] = value.lower() in ('true', '1', 'yes')
+                            logger.info(f"✓ Loaded HEALTH_SERVER_ENABLED: {cls.CONFIG['HEALTH_SERVER_ENABLED']}")
+                        elif key == 'HEALTH_SERVER_PORT':
+                            try:
+                                cls.CONFIG['HEALTH_SERVER_PORT'] = int(value)
+                            except ValueError:
+                                logger.warning(f"Invalid HEALTH_SERVER_PORT value: {value}")

        except Exception as e:
            logger.error(f"Failed to load .env file: {e}")

+        # Validate critical configuration
+        api_key = cls.CONFIG.get('TICKET_API_KEY')
+        if not api_key or api_key == 'your_api_key_here':
+            logger.warning("TICKET_API_KEY is not configured - ticket creation will fail (dry-run will still work)")
+    
    TICKET_TEMPLATES = {
        'ACTION_TYPE': {
            'AUTO': '[auto]',
@@ -607,13 +681,21 @@ class SystemHealthMonitor:
    # =============================================================================
    # INITIALIZATION
    # =============================================================================
-    def __init__(self, ticket_api_url: str = 'http://10.10.10.45/create_ticket_api.php', dry_run: bool = False):
+    def __init__(self, ticket_api_url: str = 'http://10.10.10.45/create_ticket_api.php', dry_run: bool = False, verbose: bool = False):
        """
        Initialize the system health monitor.

        :param ticket_api_url: URL for the ticket creation API.
        :param dry_run: If True, simulate API calls without sending requests.
+        :param verbose: If True, enable DEBUG-level logging output.
        """
+        # Set log verbosity
+        if verbose:
+            logger.setLevel(logging.DEBUG)
+            for handler in logger.handlers:
+                handler.setLevel(logging.DEBUG)
+            logger.debug("Verbose logging enabled")
+
        # Load environment configuration first (API keys, etc.)
        self.load_env_config()

@@ -623,13 +705,59 @@ class SystemHealthMonitor:
        # Ensure history directory exists
        os.makedirs(self.CONFIG['HISTORY_DIR'], exist_ok=True)

-    def _enforce_storage_limit(self, history_dir: str, max_bytes: int = 10485760):
+        # Drive details cache (per-run, cleared on next execution)
+        self._drive_details_cache = {}
+
+        # Health check tracking
+        self._last_check_timestamp = None
+        self._last_check_status = 'unknown'
+
+        # Check tool availability at startup
+        self._available_tools = self._check_tool_availability()
+
+    def _check_tool_availability(self) -> Dict[str, bool]:
+        """Check which external tools are available on this system.
+
+        Returns a dict mapping tool names to availability booleans.
+        Logs warnings for missing required tools and info for missing optional tools.
        """
-        Delete oldest history files if directory exceeds size limit (default 10MB).
+        required_tools = {
+            'smartctl': 'smartmontools',
+            'lsblk': 'util-linux',
+        }
+        optional_tools = {
+            'nvme': 'nvme-cli',
+            'ceph': 'ceph-common',
+            'pct': 'pve-container',
+            'dmidecode': 'dmidecode',
+            'proxmox-backup-manager': 'proxmox-backup-server',
+            'zpool': 'zfsutils-linux',
+        }
+
+        availability = {}
+        for tool, package in required_tools.items():
+            available = shutil.which(tool) is not None
+            availability[tool] = available
+            if not available:
+                logger.warning(f"Required tool '{tool}' not found (install: apt install {package})")
+
+        for tool, package in optional_tools.items():
+            available = shutil.which(tool) is not None
+            availability[tool] = available
+            if not available:
+                logger.debug(f"Optional tool '{tool}' not found (install: apt install {package})")
+
+        return availability
+
+    def _enforce_storage_limit(self, history_dir: str, max_bytes: int = None):
+        """
+        Delete oldest history files if directory exceeds size limit.

        :param history_dir: Directory containing history files
-        :param max_bytes: Maximum directory size in bytes (default 10MB)
+        :param max_bytes: Maximum directory size in bytes (default from CONFIG)
        """
+        if max_bytes is None:
+            max_bytes = self.CONFIG.get('HISTORY_MAX_BYTES', 52428800)
        if not os.path.exists(history_dir):
            return

@@ -668,6 +796,45 @@ class SystemHealthMonitor:
        except Exception as e:
            logger.error(f"Error enforcing storage limit: {e}")

+    # =============================================================================
+    # HEALTH CHECK ENDPOINT
+    # =============================================================================
+    def _start_health_server(self):
+        """Start a lightweight HTTP health check endpoint as a daemon thread."""
+        from http.server import HTTPServer, BaseHTTPRequestHandler
+        import threading
+
+        monitor = self
+
+        class HealthHandler(BaseHTTPRequestHandler):
+            def do_GET(self):
+                if self.path == '/health':
+                    response = {
+                        'status': monitor._last_check_status,
+                        'hostname': socket.gethostname(),
+                        'last_check': monitor._last_check_timestamp,
+                        'uptime': datetime.datetime.now().isoformat()
+                    }
+                    self.send_response(200)
+                    self.send_header('Content-Type', 'application/json')
+                    self.end_headers()
+                    self.wfile.write(json.dumps(response).encode())
+                else:
+                    self.send_response(404)
+                    self.end_headers()
+
+            def log_message(self, format, *args):
+                logger.debug(f"Health server: {format % args}")
+
+        port = self.CONFIG.get('HEALTH_SERVER_PORT', 9102)
+        try:
+            server = HTTPServer(('', port), HealthHandler)
+            thread = threading.Thread(target=server.serve_forever, daemon=True)
+            thread.start()
+            logger.info(f"Health check endpoint started on port {port}")
+        except OSError as e:
+            logger.warning(f"Could not start health server on port {port}: {e}")
+
    # =============================================================================
    # MAIN EXECUTION METHODS
    # =============================================================================
@@ -677,6 +844,10 @@ class SystemHealthMonitor:
            # Perform health checks and gather the report
            health_report = self.perform_health_checks()

+            # Track last check for health endpoint
+            self._last_check_timestamp = datetime.datetime.now().isoformat()
+            self._last_check_status = health_report.get('drives_health', {}).get('overall_status', 'unknown')
+
            # Create tickets for any detected critical issues
            self._create_tickets_for_issues(health_report)

@@ -699,7 +870,8 @@ class SystemHealthMonitor:
            'network_health': self._check_network_status(),
            'ceph_health': self._check_ceph_health(),
            'lxc_health': self._check_lxc_storage(),
-            'system_health': self._check_system_drive_indicators()
+            'system_health': self._check_system_drive_indicators(),
+            'pbs_health': self._check_pbs_health()
        }
        
        if self.dry_run:
@@ -755,6 +927,17 @@ class SystemHealthMonitor:
            if health_report['system_health']['issues']:
                logger.info(f"\nSystem Issues: {len(health_report['system_health']['issues'])} found")

+            # PBS status
+            pbs = health_report.get('pbs_health', {})
+            if pbs.get('is_pbs_node'):
+                logger.info("\nPBS Status:")
+                for pool in pbs.get('zfs_pools', []):
+                    logger.info(f"  ZFS Pool '{pool['name']}': {pool['usage_percent']}% used ({pool['used']}/{pool['total']})")
+                if pbs.get('failed_tasks'):
+                    logger.info(f"  Failed tasks: {len(pbs['failed_tasks'])}")
+                if pbs.get('issues'):
+                    logger.info(f"  Issues: {len(pbs['issues'])}")
+
            logger.info("\n=== End Summary ===")
        
        return health_report
@@ -776,7 +959,10 @@ class SystemHealthMonitor:

            # Load historical data with file locking
            history = []
-            file_mode = 'r+' if os.path.exists(historical_file) else 'w+'
+            if os.path.exists(historical_file) and os.path.getsize(historical_file) > 0:
+                file_mode = 'r+'
+            else:
+                file_mode = 'w+'

            with open(historical_file, file_mode) as f:
                # Acquire exclusive lock
@@ -1025,7 +1211,10 @@ class SystemHealthMonitor:
    # DRIVE HEALTH CHECKING METHODS
    # =============================================================================
    def _get_drive_details(self, device: str) -> Dict[str, str]:
-        """Get detailed drive information using smartctl."""
+        """Get detailed drive information using smartctl (cached per run)."""
+        if device in self._drive_details_cache:
+            return self._drive_details_cache[device]
+
        drive_details = {
            'model': None,
            'serial': None,
@@ -1041,7 +1230,8 @@ class SystemHealthMonitor:
                ['smartctl', '-i', device],
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
-                text=True
+                text=True,
+                timeout=30
            )
            
            # Check if smartctl failed completely
@@ -1079,6 +1269,7 @@ class SystemHealthMonitor:
        except Exception as e:
            logger.debug(f"Error getting drive details for {device}: {e}")

+        self._drive_details_cache[device] = drive_details
        return drive_details


@@ -1501,7 +1692,9 @@ class SystemHealthMonitor:
            'critical reallocated', 'critical current_pending',
            'network is unreachable',
            'osd is down', 'osd down',  # Ceph OSD down
-            'cluster usage critical'  # Ceph usage critical
+            'cluster usage critical',  # Ceph usage critical
+            'zfs pool', 'backup failed',  # PBS critical issues
+            'usage critical'  # PBS ZFS critical usage
        ]):
            return self.PRIORITIES['HIGH']  # P2

@@ -1520,7 +1713,8 @@ class SystemHealthMonitor:
            'warning', 'high temperature', 'correctable ecc',
            'trend alert', 'critical storage usage',
            'low available_spare', 'high wear',
-            'health_warn', 'cluster usage warning'  # Ceph warnings
+            'health_warn', 'cluster usage warning',  # Ceph warnings
+            'gc failed', 'sync failed', 'usage high'  # PBS warnings
        ]):
            return self.PRIORITIES['MEDIUM']  # P3

@@ -1644,6 +1838,27 @@ class SystemHealthMonitor:
                    self.TICKET_TEMPLATES['TICKET_TYPE']['PROBLEM']
                )

+        # PBS Issues - Backup server issues (categorized as Hardware for storage, Software for tasks)
+        if any(keyword in issue_lower for keyword in [
+            'pbs', 'zfs pool', 'backup failed', 'gc failed', 'sync failed'
+        ]):
+            if any(error in issue_lower for error in [
+                'degraded', 'critical', 'failed', 'errors'
+            ]):
+                return (
+                    self.TICKET_CATEGORIES['HARDWARE'],
+                    self.TICKET_TYPES['ISSUE'],
+                    '[pbs]',
+                    self.TICKET_TEMPLATES['TICKET_TYPE']['ISSUE']
+                )
+            else:
+                return (
+                    self.TICKET_CATEGORIES['HARDWARE'],
+                    self.TICKET_TYPES['PROBLEM'],
+                    '[pbs]',
+                    self.TICKET_TEMPLATES['TICKET_TYPE']['PROBLEM']
+                )
+
        # Default: Hardware Problem (for undefined cases)
        return (
            self.TICKET_CATEGORIES['HARDWARE'],
@@ -1717,6 +1932,10 @@ class SystemHealthMonitor:
            )
            description = self._generate_detailed_description(issue, health_report, priority)

+            # NOTE: The ticket API (create_ticket_api.php) deduplicates using a SHA-256 hash of:
+            #   issue_category + environment_tags + hostname (excluded for [cluster-wide]) + device
+            # Description content and timestamps are NOT included in the dedup hash.
+            # The 24-hour dedup window prevents duplicate tickets from multiple nodes or runs.
            ticket_payload = {
                "title": ticket_title,
                "description": description,
@@ -1776,7 +1995,7 @@ class SystemHealthMonitor:
                continue
                
            # Only report issues for drives with valid SMART status
-            if drive.get('smart_issues') and drive.get('smart_status') in ['HEALTHY', 'UNHEALTHY', 'UNKNOWN']:
+            if drive.get('smart_issues') and drive.get('smart_status') in ['HEALTHY', 'UNHEALTHY', 'UNKNOWN', 'REPLACEMENT_NEEDED']:
                # Filter out generic error messages and manufacturer-specific false positives
                filtered_issues = []
                for issue in drive['smart_issues']:
@@ -1840,7 +2059,8 @@ class SystemHealthMonitor:
            designated_node = self.CONFIG.get('CEPH_TICKET_NODE')

            # Cluster-wide issues: only create tickets from designated node (or first node if not set)
-            # The [cluster-wide] tag ensures deduplication in tinker_tickets API
+            # The [cluster-wide] tag + CLUSTER_NAME in ticket title ensures cross-node deduplication
+            # in the tinker_tickets API (dedup hash excludes hostname for cluster-wide issues)
            if ceph_health.get('cluster_wide_issues'):
                # If no designated node, all nodes can report (API deduplicates)
                # If designated node is set, only that node creates tickets
@@ -1856,6 +2076,12 @@ class SystemHealthMonitor:
                for issue in ceph_health['issues']:
                    issues.append(f"[ceph] {issue}")

+        # Check for PBS issues
+        pbs_health = health_report.get('pbs_health', {})
+        if pbs_health.get('is_pbs_node') and pbs_health.get('issues'):
+            for issue in pbs_health['issues']:
+                issues.append(f"[pbs] {issue.get('issue', str(issue))}")
+
        logger.info("=== Issue Detection Started ===")
        logger.info(f"Checking drives: {len(health_report['drives_health']['drives'])} found")
        logger.info(f"Memory status: {health_report['memory_health']['status']}")
@@ -1886,42 +2112,26 @@ class SystemHealthMonitor:
    # DISK AND STORAGE UTILITY METHODS
    # =============================================================================
    def _get_all_disks(self) -> List[str]:
-        """Get all physical disks using multiple detection methods."""
+        """Get all physical disks using lsblk with full device paths."""
        disks = set()
-        
-        # Method 1: Use lsblk to get physical disks, excluding virtual devices
        try:
            result = subprocess.run(
-                ['lsblk', '-d', '-n', '-o', 'NAME,TYPE'],
+                ['lsblk', '-d', '-n', '-o', 'NAME,TYPE', '-p'],
                stdout=subprocess.PIPE,
-                text=True
+                text=True,
+                timeout=10
            )
            for line in result.stdout.strip().split('\n'):
                if line:
                    parts = line.split()
-                    if len(parts) >= 2:
-                        name, device_type = parts[0], parts[1]
-                        # Only include actual disks, exclude virtual devices
-                        if device_type == 'disk' and not name.startswith('rbd'):
-                            disks.add(f"/dev/{name}")
-            logger.debug(f"Physical disks found via lsblk: {disks}")
+                    if len(parts) >= 2 and parts[1] == 'disk' and not parts[0].startswith('/dev/rbd'):
+                        disks.add(parts[0])
+            logger.debug(f"Physical disks found: {disks}")
+        except subprocess.TimeoutExpired:
+            logger.error("lsblk timed out during disk detection")
        except Exception as e:
-            logger.debug(f"lsblk detection failed: {e}")
-
-        # Method 2: Direct device scanning for physical devices only
-        for pattern in ['/dev/sd[a-z]', '/dev/nvme[0-9]n[0-9]']:
-            try:
-                import glob
-                matches = glob.glob(pattern)
-                # Filter out partitions (devices ending in numbers for sd*, already filtered for nvme)
-                if 'sd' in pattern:
-                    matches = [d for d in matches if not d[-1].isdigit()]
-                disks.update(matches)
-                logger.debug(f"Disks found via glob {pattern}: {matches}")
-            except Exception as e:
-                logger.debug(f"Glob detection failed for {pattern}: {e}")
-
-        return list(disks)
+            logger.error(f"Failed to detect disks: {e}")
+        return sorted(disks)

    def _is_excluded_mount(self, mountpoint: str) -> bool:
        """Check if a mountpoint should be excluded from monitoring."""
@@ -2055,7 +2265,8 @@ class SystemHealthMonitor:
                ['smartctl', '-i', device],
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
-                text=True
+                text=True,
+                timeout=30
            )

            model_line = None
@@ -2095,6 +2306,8 @@ class SystemHealthMonitor:
            logger.debug(f"Known issues: {firmware_info['known_issues']}")
            logger.debug("=== End Firmware Check ===\n")

+        except subprocess.TimeoutExpired:
+            logger.warning(f"smartctl -i timed out for {device}")
        except Exception as e:
            firmware_info['known_issues'].append(f"Error checking firmware: {str(e)}")

@@ -2266,7 +2479,7 @@ class SystemHealthMonitor:

    def _is_new_drive(self, power_on_hours: int) -> bool:
        """Determine if a drive is considered "new" based on power-on hours."""
-        return power_on_hours < 720  # Less than 1 week of runtime
+        return power_on_hours < self.CONFIG['NEW_DRIVE_HOURS_THRESHOLD']

    def _check_smart_health(self, device: str) -> Dict[str, Any]:
        """Enhanced SMART health check with better error handling and predictive analysis."""
@@ -2303,12 +2516,13 @@ class SystemHealthMonitor:
                smart_health['issues'].append("Unable to read device information")
                return smart_health

-            # Skip Ridata drives entirely - unreliable and being replaced
+            # Ridata drives - known unreliable hardware, flag for replacement
            manufacturer = self._detect_manufacturer(drive_details.get('model', ''))
            if manufacturer == 'Ridata':
-                smart_health['status'] = 'SKIPPED'
-                smart_health['issues'].append("Ridata drive - monitoring disabled (unreliable hardware)")
-                logger.debug(f"Skipping SMART monitoring for Ridata drive {device}")
+                smart_health['status'] = 'REPLACEMENT_NEEDED'
+                smart_health['severity'] = 'WARNING'
+                smart_health['issues'].append("Ridata drive detected - known unreliable hardware, replacement recommended")
+                logger.info(f"Ridata drive {device} flagged for replacement")
                return smart_health

            logger.debug(f"Drive details for {device}: {drive_details}")
@@ -2501,7 +2715,7 @@ class SystemHealthMonitor:
            for match in error_matches:
                error_hour = int(match.group(1))
                current_hours = smart_health['attributes'].get('Power_On_Hours', 0)
-                if current_hours - error_hour < 168:  # Errors within last week
+                if current_hours - error_hour < self.CONFIG['SMART_ERROR_RECENT_HOURS']:
                    recent_errors.append(match.group(0))

            if recent_errors:
@@ -2540,8 +2754,8 @@ class SystemHealthMonitor:
            logger.debug(f"Detected Issues: {smart_health['issues']}")
            logger.debug("=== End SMART Check ===\n")

-            # Special handling for NVMe drives
-            if 'nvme' in device:
+            # Special handling for NVMe drives (requires nvme-cli)
+            if 'nvme' in device and self._available_tools.get('nvme'):
                try:
                    nvme_result = subprocess.run(
                        ['nvme', 'smart-log', device],
@@ -2564,10 +2778,10 @@ class SystemHealthMonitor:
                                temp_str = line.split(':')[1].strip() if ':' in line else line.strip()
                                logger.debug(f"Raw temperature string: {temp_str}")
                                
-                                # Extract first temperature value more safely
-                                digits = ''.join(c for c in temp_str if c.isdigit())
-                                if len(digits) >= 2:
-                                    temp_value = int(digits[:2])
+                                # Extract the first complete number from temperature string
+                                temp_match = re.search(r'(\d+)', temp_str)
+                                if temp_match:
+                                    temp_value = int(temp_match.group(1))
                                    logger.debug(f"Parsed temperature value: {temp_value}")
                                    
                                    # Set both temperature fields
@@ -2608,6 +2822,10 @@ class SystemHealthMonitor:
            'manufacturer_profile': None
        }

+        if not self._available_tools.get('nvme'):
+            logger.debug(f"nvme-cli not available, skipping NVMe health check for {device}")
+            return smart_health
+
        try:
            # Use nvme-cli for NVMe devices
            result = subprocess.run(
@@ -2674,6 +2892,11 @@ class SystemHealthMonitor:
        """Check health of all drives in the system."""
        drives_health = {'overall_status': 'NORMAL', 'drives': []}

+        if not self._available_tools.get('smartctl') or not self._available_tools.get('lsblk'):
+            logger.warning("Drive health checks skipped: smartctl or lsblk not available")
+            drives_health['overall_status'] = 'UNKNOWN'
+            return drives_health
+
        try:
            # Get only valid physical disks
            physical_disks = self._get_all_disks()
@@ -2698,6 +2921,20 @@ class SystemHealthMonitor:
                        device_partitions[base_dev] = []
                    device_partitions[base_dev].append(part)

+            # Run SMART checks in parallel across all drives
+            smart_results = {}
+            max_workers = min(8, len(physical_disks))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:
+                futures = {executor.submit(self._check_smart_health, disk): disk for disk in physical_disks}
+                for future in as_completed(futures):
+                    disk = futures[future]
+                    try:
+                        smart_results[disk] = future.result()
+                    except Exception as e:
+                        logger.error(f"SMART check failed for {disk}: {e}")
+                        smart_results[disk] = {'status': 'ERROR', 'issues': [str(e)], 'temp': None, 'attributes': {}}
+
+            # Build drive reports in original disk order
            overall_status = 'NORMAL'
            for disk in physical_disks:
                drive_report = {
@@ -2733,8 +2970,8 @@ class SystemHealthMonitor:
                    if total_space > 0:
                        drive_report['usage_percent'] = (total_used / total_space) * 100

-                # Check SMART health
-                smart_health = self._check_smart_health(disk)
+                # Use pre-fetched SMART results
+                smart_health = smart_results.get(disk, {'status': 'ERROR', 'issues': [], 'temp': None, 'attributes': {}})
                drive_report.update({
                    'smart_status': smart_health['status'],
                    'smart_issues': smart_health['issues'],
@@ -2798,17 +3035,19 @@ class SystemHealthMonitor:
        }
        
        try:
-            # First check using dmidecode
-            result = subprocess.run(
-                ['dmidecode', '--type', 'memory'],
-                stdout=subprocess.PIPE,
-                stderr=subprocess.PIPE,
-                text=True
-            )
-            if 'Error Correction Type: Multi-bit ECC' in result.stdout:
-                memory_health['has_ecc'] = True
+            # First check using dmidecode (if available)
+            if self._available_tools.get('dmidecode'):
+                result = subprocess.run(
+                    ['dmidecode', '--type', 'memory'],
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.PIPE,
+                    text=True,
+                    timeout=30
+                )
+                if 'Error Correction Type: Multi-bit ECC' in result.stdout:
+                    memory_health['has_ecc'] = True

-            # If dmidecode didn't find ECC, try the edac method as backup
+            # If dmidecode unavailable or didn't find ECC, try the edac method as backup
            if not memory_health['has_ecc']:
                edac_path = '/sys/devices/system/edac/mc'
                if os.path.exists(edac_path) and os.listdir(edac_path):
@@ -2945,8 +3184,6 @@ class SystemHealthMonitor:
        Returns health status, cluster info, and any issues detected.
        Cluster-wide issues use [cluster-wide] tag for cross-node deduplication.
        """
-        import shutil
-
        ceph_health = {
            'status': 'OK',
            'is_ceph_node': False,
@@ -2955,7 +3192,7 @@ class SystemHealthMonitor:
            'osd_status': [],
            'mon_status': [],
            'issues': [],
-            'cluster_wide_issues': []  # Issues that apply to entire cluster
+            'cluster_wide_issues': []  # Issues affecting entire cluster; use CLUSTER_NAME for dedup
        }

        # Check if Ceph monitoring is enabled
@@ -2964,7 +3201,7 @@ class SystemHealthMonitor:
            return ceph_health

        # Check if ceph CLI is available
-        if not shutil.which('ceph'):
+        if not self._available_tools.get('ceph'):
            logger.debug("Ceph CLI not found - not a Ceph node")
            return ceph_health

@@ -3131,6 +3368,186 @@ class SystemHealthMonitor:

        return ceph_health

+    # =============================================================================
+    # PBS (PROXMOX BACKUP SERVER) HEALTH CHECKS
+    # =============================================================================
+    def _check_pbs_health(self) -> Dict[str, Any]:
+        """
+        Check Proxmox Backup Server health including ZFS pools and task status.
+
+        Returns health status for ZFS pools, failed backup/GC/sync jobs.
+        Only active when PBS_ENABLED=true and relevant tools are available.
+        """
+        pbs_health = {
+            'status': 'OK',
+            'is_pbs_node': False,
+            'zfs_pools': [],
+            'failed_tasks': [],
+            'issues': []
+        }
+
+        if not self.CONFIG.get('PBS_ENABLED', False):
+            logger.debug("PBS monitoring disabled in config")
+            return pbs_health
+
+        if not self._available_tools.get('zpool'):
+            logger.debug("zpool not available - skipping PBS ZFS checks")
+            return pbs_health
+
+        pbs_health['is_pbs_node'] = True
+
+        # Check ZFS pool status
+        try:
+            result = subprocess.run(
+                ['zpool', 'status', '-p'],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+                text=True,
+                timeout=30
+            )
+            if result.returncode == 0:
+                current_pool = None
+                for line in result.stdout.splitlines():
+                    line_stripped = line.strip()
+                    if line_stripped.startswith('pool:'):
+                        current_pool = line_stripped.split(':', 1)[1].strip()
+                    elif line_stripped.startswith('state:') and current_pool:
+                        state = line_stripped.split(':', 1)[1].strip()
+                        if state != 'ONLINE':
+                            pbs_health['status'] = 'CRITICAL'
+                            pbs_health['issues'].append({
+                                'type': 'PBS_ZFS_DEGRADED',
+                                'severity': 'CRITICAL',
+                                'device': current_pool,
+                                'issue': f"ZFS pool '{current_pool}' state: {state}"
+                            })
+                    elif line_stripped.startswith('errors:') and current_pool:
+                        if 'No known data errors' not in line_stripped:
+                            pbs_health['issues'].append({
+                                'type': 'PBS_ZFS_ERRORS',
+                                'severity': 'WARNING',
+                                'device': current_pool,
+                                'issue': f"ZFS pool '{current_pool}' has errors: {line_stripped}"
+                            })
+        except subprocess.TimeoutExpired:
+            logger.warning("zpool status timed out")
+        except Exception as e:
+            logger.error(f"Error checking ZFS pool status: {e}")
+
+        # Check ZFS pool usage
+        try:
+            result = subprocess.run(
+                ['zpool', 'list', '-Hp'],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+                text=True,
+                timeout=30
+            )
+            if result.returncode == 0:
+                for line in result.stdout.splitlines():
+                    parts = line.split('\t')
+                    if len(parts) >= 8:
+                        pool_name = parts[0]
+                        try:
+                            total_bytes = int(parts[1])
+                            used_bytes = int(parts[2])
+                            usage_pct = (used_bytes / total_bytes * 100) if total_bytes > 0 else 0
+                        except (ValueError, ZeroDivisionError):
+                            continue
+
+                        pool_info = {
+                            'name': pool_name,
+                            'total': self._convert_bytes(total_bytes),
+                            'used': self._convert_bytes(used_bytes),
+                            'usage_percent': round(usage_pct, 1),
+                            'health': parts[9] if len(parts) > 9 else 'UNKNOWN'
+                        }
+                        pbs_health['zfs_pools'].append(pool_info)
+
+                        if usage_pct >= self.CONFIG['PBS_ZFS_CRITICAL']:
+                            pbs_health['status'] = 'CRITICAL'
+                            pbs_health['issues'].append({
+                                'type': 'PBS_ZFS_USAGE_CRITICAL',
+                                'severity': 'CRITICAL',
+                                'device': pool_name,
+                                'issue': f"ZFS pool '{pool_name}' usage critical: {usage_pct:.1f}%"
+                            })
+                        elif usage_pct >= self.CONFIG['PBS_ZFS_WARNING']:
+                            if pbs_health['status'] != 'CRITICAL':
+                                pbs_health['status'] = 'WARNING'
+                            pbs_health['issues'].append({
+                                'type': 'PBS_ZFS_USAGE_WARNING',
+                                'severity': 'WARNING',
+                                'device': pool_name,
+                                'issue': f"ZFS pool '{pool_name}' usage high: {usage_pct:.1f}%"
+                            })
+        except subprocess.TimeoutExpired:
+            logger.warning("zpool list timed out")
+        except Exception as e:
+            logger.error(f"Error checking ZFS pool usage: {e}")
+
+        # Check failed PBS tasks (requires proxmox-backup-manager)
+        if self._available_tools.get('proxmox-backup-manager'):
+            try:
+                result = subprocess.run(
+                    ['proxmox-backup-manager', 'task', 'list', '--output-format', 'json'],
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.PIPE,
+                    text=True,
+                    timeout=30
+                )
+                if result.returncode == 0:
+                    try:
+                        tasks = json.loads(result.stdout)
+                        for task in tasks:
+                            task_status = task.get('status', '')
+                            task_type = task.get('worker_type', '')
+                            task_id = task.get('worker_id', '')
+
+                            if task_status and task_status != 'OK':
+                                failed_task = {
+                                    'type': task_type,
+                                    'id': task_id,
+                                    'status': task_status,
+                                    'starttime': task.get('starttime', ''),
+                                    'endtime': task.get('endtime', '')
+                                }
+                                pbs_health['failed_tasks'].append(failed_task)
+
+                                # Categorize by task type
+                                if 'backup' in task_type.lower():
+                                    issue_type = 'PBS_BACKUP_FAILED'
+                                    severity = 'CRITICAL'
+                                elif 'gc' in task_type.lower() or 'garbage' in task_type.lower():
+                                    issue_type = 'PBS_GC_FAILED'
+                                    severity = 'WARNING'
+                                elif 'sync' in task_type.lower():
+                                    issue_type = 'PBS_SYNC_FAILED'
+                                    severity = 'WARNING'
+                                else:
+                                    issue_type = 'PBS_BACKUP_FAILED'
+                                    severity = 'WARNING'
+
+                                pbs_health['issues'].append({
+                                    'type': issue_type,
+                                    'severity': severity,
+                                    'device': f"task-{task_type}",
+                                    'issue': f"PBS {task_type} failed: {task_id} - {task_status}"
+                                })
+
+                                if severity == 'CRITICAL':
+                                    pbs_health['status'] = 'CRITICAL'
+                                elif pbs_health['status'] == 'OK':
+                                    pbs_health['status'] = 'WARNING'
+                    except json.JSONDecodeError as e:
+                        logger.warning(f"Failed to parse PBS task list JSON: {e}")
+            except subprocess.TimeoutExpired:
+                logger.warning("proxmox-backup-manager task list timed out")
+            except Exception as e:
+                logger.error(f"Error checking PBS tasks: {e}")
+
+        return pbs_health
+
    # =============================================================================
    # PROMETHEUS METRICS EXPORT
    # =============================================================================
@@ -3148,9 +3565,11 @@ class SystemHealthMonitor:
        hostname = health_report.get('hostname', socket.gethostname())
        metrics = []

-        # Helper to format labels
+        # Helper to format labels with proper Prometheus escaping
        def labels(**kwargs) -> str:
-            pairs = [f'{k}="{v}"' for k, v in kwargs.items() if v is not None]
+            def escape(value):
+                return str(value).replace('\\', '\\\\').replace('"', '\\"').replace('\n', '\\n')
+            pairs = [f'{k}="{escape(v)}"' for k, v in kwargs.items() if v is not None]
            return '{' + ','.join(pairs) + '}' if pairs else ''

        # === System Info ===
@@ -3275,6 +3694,18 @@ class SystemHealthMonitor:
                    usage = fs.get('usage_percent', 0)
                    metrics.append(f'hwmon_lxc_storage_usage_percent{labels(hostname=hostname, vmid=vmid, mountpoint=mountpoint)} {usage}')

+        # === PBS Metrics ===
+        pbs = health_report.get('pbs_health', {})
+        if pbs.get('is_pbs_node'):
+            metrics.append(f'# HELP hwmon_pbs_zfs_usage_percent PBS ZFS pool usage percentage')
+            metrics.append(f'# TYPE hwmon_pbs_zfs_usage_percent gauge')
+            for pool in pbs.get('zfs_pools', []):
+                metrics.append(f'hwmon_pbs_zfs_usage_percent{labels(hostname=hostname, pool=pool["name"])} {pool["usage_percent"]}')
+
+            metrics.append(f'# HELP hwmon_pbs_failed_tasks_total PBS failed task count')
+            metrics.append(f'# TYPE hwmon_pbs_failed_tasks_total gauge')
+            metrics.append(f'hwmon_pbs_failed_tasks_total{labels(hostname=hostname)} {len(pbs.get("failed_tasks", []))}')
+
        # === Issue Summary Metrics ===
        metrics.append(f'# HELP hwmon_issues_total Total number of issues detected')
        metrics.append(f'# TYPE hwmon_issues_total gauge')
@@ -3282,7 +3713,8 @@ class SystemHealthMonitor:
        system_issues = len(health_report.get('system_health', {}).get('issues', []))
        ceph_issues = len(ceph.get('issues', [])) + len(ceph.get('cluster_wide_issues', []))
        lxc_issues = len(lxc.get('issues', []))
-        total_issues = system_issues + ceph_issues + lxc_issues
+        pbs_issues = len(pbs.get('issues', []))
+        total_issues = system_issues + ceph_issues + lxc_issues + pbs_issues
        metrics.append(f'hwmon_issues_total{labels(hostname=hostname)} {total_issues}')

        return '\n'.join(metrics) + '\n'
@@ -3369,6 +3801,10 @@ class SystemHealthMonitor:
            'issues': []
        }

+        if not self._available_tools.get('pct'):
+            logger.debug("pct not available - not a PVE node or pve-container not installed")
+            return lxc_health
+
        try:
            result = subprocess.run(
                ['pct', 'list'],
@@ -3409,32 +3845,38 @@ class SystemHealthMonitor:
                        if not fs_line.strip() or 'MP' in fs_line:
                            continue
                        
-                        # Fix: Use fs_line instead of line, and columns consistently
-                        columns = fs_line.split()
+                        # Parse df output using regex for reliable column extraction
+                        match = re.match(
+                            r'(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+\.?\d*)%?\s+(.*)',
+                            fs_line.strip()
+                        )
+                        if not match:
+                            logger.debug(f"Could not parse filesystem line: {fs_line}")
+                            continue

-                        if len(columns) >= 6:
-                            try:
-                                # Skip excluded mounts by checking the first column
-                                if columns[0].startswith('appPool:') or '/mnt/pve/mediaf' in columns[1]:
+                        pool, device_col, total_str, used_str, avail_str, percent_str, mountpoint = match.groups()
+
+                        try:
+                                # Skip excluded mounts
+                                if pool.startswith('appPool:') or '/mnt/pve/mediaf' in device_col:
                                    continue

-                                # Get the mountpoint (last column)
-                                mountpoint = columns[-1]
+                                mountpoint = mountpoint.strip()

                                # Skip excluded mountpoints
                                if self._is_excluded_mount(mountpoint):
                                    logger.debug(f"Skipping excluded mount: {mountpoint}")
                                    continue

-                                # Parse size values safely - use correct column indices
-                                total_space = self._parse_size(columns[2])  # 3rd column
-                                used_space = self._parse_size(columns[3])   # 4th column  
-                                available_space = self._parse_size(columns[4])  # 5th column
+                                # Parse size values from named regex groups
+                                total_space = self._parse_size(total_str)
+                                used_space = self._parse_size(used_str)
+                                available_space = self._parse_size(avail_str)

-                                # Parse percentage safely
+                                # Parse percentage from regex group
                                try:
-                                    usage_percent = float(columns[5].rstrip('%'))  # 6th column
-                                except (ValueError, IndexError):
+                                    usage_percent = float(percent_str)
+                                except ValueError:
                                    # Calculate percentage if parsing fails
                                    usage_percent = (used_space / total_space * 100) if total_space > 0 else 0
                                
@@ -3459,10 +3901,10 @@ class SystemHealthMonitor:
                                    lxc_health['issues'].append(issue)

                                logger.debug(f"Filesystem details: {filesystem}")
-                            except Exception as e:
-                                logger.debug(f"Error processing line: {str(e)}")
-                                logger.debug(f"Full exception: {repr(e)}")
-                                continue
+                        except Exception as e:
+                            logger.debug(f"Error processing line: {str(e)}")
+                            logger.debug(f"Full exception: {repr(e)}")
+                            continue
                    
                    # Only add container info if we have filesystem data
                    if container_info['filesystems']:
@@ -3501,13 +3943,28 @@ def main():
        metavar="FILE",
        help="Export health report to JSON file."
    )
+    parser.add_argument(
+        "-v", "--verbose",
+        action="store_true",
+        help="Enable verbose (DEBUG) logging output."
+    )
+    parser.add_argument(
+        "--health-server",
+        action="store_true",
+        help="Start HTTP health check endpoint (default port 9102)."
+    )
    args = parser.parse_args()

    monitor = SystemHealthMonitor(
        ticket_api_url=SystemHealthMonitor.CONFIG['TICKET_API_URL'],
-        dry_run=args.dry_run
+        dry_run=args.dry_run,
+        verbose=args.verbose
    )

+    # Start health server if requested via CLI or .env
+    if args.health_server or monitor.CONFIG.get('HEALTH_SERVER_ENABLED', False):
+        monitor._start_health_server()
+
    if args.metrics:
        # Just output metrics to stdout
        health_report = monitor.perform_health_checks()
Author	SHA1	Message	Date
jared	d1750ea6cf	Add Proxmox Backup Server (PBS) health monitoring support Monitors ZFS pool status/usage and failed PBS tasks (backup, GC, sync). Includes configurable thresholds (PBS_ZFS_WARNING/CRITICAL), Prometheus metrics (hwmon_pbs_*), dry-run summary, issue categorization, and priority classification. Enabled via PBS_ENABLED=true in .env config. Fixes: #5 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:18:41 -05:00
jared	07782da7b6	Add HTTP health check endpoint on port 9102 Lightweight /health endpoint returns JSON with status, hostname, and last check timestamp. Runs as daemon thread, activated via --health-server flag or HEALTH_SERVER_ENABLED=true in .env config. Fixes: #21 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:15:15 -05:00
jared	b02e416117	Parallelize SMART health checks across drives with ThreadPoolExecutor Runs SMART checks concurrently (up to 8 workers) instead of sequentially, significantly reducing check time on multi-drive systems. Results are collected and processed in original disk order. Fixes: #22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:13:50 -05:00
jared	7b36255fb4	Add graceful degradation when external tools are missing Checks availability of required (smartctl, lsblk) and optional (nvme, ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code sections to skip gracefully with informative log messages instead of crashing. Also fixes pre-existing indentation bug in LXC exception handler. Fixes: #19 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:13:08 -05:00
jared	92bca248ac	Add deduplication clarification comments for Ceph ticket handling Explains that the ticket API deduplicates using SHA-256 hash of (category + tags + hostname + device), not description/timestamp. Clarifies the 24-hour dedup window and cluster-wide hostname exclusion. Fixes: #18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:03:33 -05:00
jared	4a186fb6d6	Create replacement tickets for Ridata drives instead of silently skipping Ridata drives are known unreliable hardware. Instead of skipping them with no notification, flag as REPLACEMENT_NEEDED and create tickets recommending replacement. Resolves #13 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:01:24 -05:00
jared	90346a2da1	Replace fragile column-index LXC storage parsing with regex Use regex pattern matching instead of split()[N] indexing for parsing pct df output. This is more robust against variations in column formatting and whitespace. Resolves #11 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:00:50 -05:00
jared	308a8d5c5c	Cache drive details to eliminate redundant smartctl calls Add per-run cache for _get_drive_details() results. Each drive is queried once via smartctl -i and the result is reused across SMART health checks and ticket creation. Resolves #15 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:00:25 -05:00
jared	9f9cc1b763	Simplify disk detection to single lsblk call with full paths Replace dual-method detection (lsblk + glob scanning) with single lsblk -p call that returns full device paths directly. Adds timeout, returns sorted results for consistency. Resolves #14 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:59:49 -05:00
jared	ab67d786ce	Increase history storage limit to 50MB to match retention needs With 50 drives checked hourly over 30 days, history data can reach ~36MB which exceeded the old 10MB limit causing constant file churn. Increase to 50MB and make configurable via HISTORY_MAX_BYTES in .env. Resolves #12 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:59:24 -05:00
jared	da2de4375e	Add verbosity control with -v/--verbose flag Change default log level from DEBUG to INFO to reduce noise during hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging when needed for troubleshooting. Resolves #16 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:58:43 -05:00
jared	38dd120da2	Add config validation for .env values Wrap all int() conversions in try/except to handle malformed .env values gracefully. Validate TICKET_API_KEY is not empty or placeholder value, logging a warning instead of raising to preserve dry-run compatibility. Resolves #17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:58:02 -05:00
jared	7383a0c674	Escape special characters in Prometheus metric labels Add escape function to sanitize backslashes, double quotes, and newlines in label values per Prometheus text format spec. Prevents corrupted metrics output from model names or paths containing these characters. Resolves #10 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:57:37 -05:00
jared	a3cf5a698f	Add missing timeouts to all subprocess calls Add timeout=30 to smartctl -i calls in _get_drive_details() and _check_disk_firmware(), and dmidecode in _check_memory_usage(). Add TimeoutExpired handler in _check_disk_firmware(). Prevents potential hangs when drives or system tools become unresponsive. Resolves #9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:57:17 -05:00
jared	c7309663de	Fix NVMe temperature parsing bug for values > 99°C Replace digits[:2] truncation with regex extraction of complete number. Previously "123°C" would be parsed as 12 instead of 123. Resolves #8 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:56:37 -05:00
jared	0559f2d668	Fix file locking race condition in SMART trend analysis Check both file existence AND size > 0 before opening in r+ mode. Previously, an empty file (0 bytes) would be opened in r+ mode, causing json.load() to fail on the empty content. Resolves #7 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:56:21 -05:00
jared	d79005eb42	Centralize hardcoded magic numbers into CONFIG dict Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h) from inline literals to configurable CONFIG entries with .env support. Resolves #20 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:56:00 -05:00