Compare commits

...

17 Commits

Author SHA1 Message Date
d1750ea6cf Add Proxmox Backup Server (PBS) health monitoring support
Monitors ZFS pool status/usage and failed PBS tasks (backup, GC, sync).
Includes configurable thresholds (PBS_ZFS_WARNING/CRITICAL), Prometheus
metrics (hwmon_pbs_*), dry-run summary, issue categorization, and
priority classification. Enabled via PBS_ENABLED=true in .env config.

Fixes: #5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:18:41 -05:00
07782da7b6 Add HTTP health check endpoint on port 9102
Lightweight /health endpoint returns JSON with status, hostname, and
last check timestamp. Runs as daemon thread, activated via --health-server
flag or HEALTH_SERVER_ENABLED=true in .env config.

Fixes: #21

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:15:15 -05:00
b02e416117 Parallelize SMART health checks across drives with ThreadPoolExecutor
Runs SMART checks concurrently (up to 8 workers) instead of
sequentially, significantly reducing check time on multi-drive systems.
Results are collected and processed in original disk order.

Fixes: #22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:13:50 -05:00
7b36255fb4 Add graceful degradation when external tools are missing
Checks availability of required (smartctl, lsblk) and optional (nvme,
ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code
sections to skip gracefully with informative log messages instead of
crashing. Also fixes pre-existing indentation bug in LXC exception handler.

Fixes: #19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:13:08 -05:00
92bca248ac Add deduplication clarification comments for Ceph ticket handling
Explains that the ticket API deduplicates using SHA-256 hash of
(category + tags + hostname + device), not description/timestamp.
Clarifies the 24-hour dedup window and cluster-wide hostname exclusion.

Fixes: #18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:03:33 -05:00
4a186fb6d6 Create replacement tickets for Ridata drives instead of silently skipping
Ridata drives are known unreliable hardware. Instead of skipping them
with no notification, flag as REPLACEMENT_NEEDED and create tickets
recommending replacement.

Resolves #13

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:01:24 -05:00
90346a2da1 Replace fragile column-index LXC storage parsing with regex
Use regex pattern matching instead of split()[N] indexing for parsing
pct df output. This is more robust against variations in column
formatting and whitespace.

Resolves #11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:00:50 -05:00
308a8d5c5c Cache drive details to eliminate redundant smartctl calls
Add per-run cache for _get_drive_details() results. Each drive is
queried once via smartctl -i and the result is reused across SMART
health checks and ticket creation.

Resolves #15

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:00:25 -05:00
9f9cc1b763 Simplify disk detection to single lsblk call with full paths
Replace dual-method detection (lsblk + glob scanning) with single
lsblk -p call that returns full device paths directly. Adds timeout,
returns sorted results for consistency.

Resolves #14

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:59:49 -05:00
ab67d786ce Increase history storage limit to 50MB to match retention needs
With 50 drives checked hourly over 30 days, history data can reach ~36MB
which exceeded the old 10MB limit causing constant file churn. Increase
to 50MB and make configurable via HISTORY_MAX_BYTES in .env.

Resolves #12

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:59:24 -05:00
da2de4375e Add verbosity control with -v/--verbose flag
Change default log level from DEBUG to INFO to reduce noise during
hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging
when needed for troubleshooting.

Resolves #16

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:58:43 -05:00
38dd120da2 Add config validation for .env values
Wrap all int() conversions in try/except to handle malformed .env values
gracefully. Validate TICKET_API_KEY is not empty or placeholder value,
logging a warning instead of raising to preserve dry-run compatibility.

Resolves #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:58:02 -05:00
7383a0c674 Escape special characters in Prometheus metric labels
Add escape function to sanitize backslashes, double quotes, and newlines
in label values per Prometheus text format spec. Prevents corrupted
metrics output from model names or paths containing these characters.

Resolves #10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:37 -05:00
a3cf5a698f Add missing timeouts to all subprocess calls
Add timeout=30 to smartctl -i calls in _get_drive_details() and
_check_disk_firmware(), and dmidecode in _check_memory_usage().
Add TimeoutExpired handler in _check_disk_firmware(). Prevents
potential hangs when drives or system tools become unresponsive.

Resolves #9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:17 -05:00
c7309663de Fix NVMe temperature parsing bug for values > 99°C
Replace digits[:2] truncation with regex extraction of complete number.
Previously "123°C" would be parsed as 12 instead of 123.

Resolves #8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:37 -05:00
0559f2d668 Fix file locking race condition in SMART trend analysis
Check both file existence AND size > 0 before opening in r+ mode.
Previously, an empty file (0 bytes) would be opened in r+ mode, causing
json.load() to fail on the empty content.

Resolves #7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:21 -05:00
d79005eb42 Centralize hardcoded magic numbers into CONFIG dict
Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h)
from inline literals to configurable CONFIG entries with .env support.

Resolves #20

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:00 -05:00

File diff suppressed because it is too large Load Diff