Monitors ZFS pool status/usage and failed PBS tasks (backup, GC, sync).
Includes configurable thresholds (PBS_ZFS_WARNING/CRITICAL), Prometheus
metrics (hwmon_pbs_*), dry-run summary, issue categorization, and
priority classification. Enabled via PBS_ENABLED=true in .env config.
Fixes: #5
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lightweight /health endpoint returns JSON with status, hostname, and
last check timestamp. Runs as daemon thread, activated via --health-server
flag or HEALTH_SERVER_ENABLED=true in .env config.
Fixes: #21
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs SMART checks concurrently (up to 8 workers) instead of
sequentially, significantly reducing check time on multi-drive systems.
Results are collected and processed in original disk order.
Fixes: #22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Checks availability of required (smartctl, lsblk) and optional (nvme,
ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code
sections to skip gracefully with informative log messages instead of
crashing. Also fixes pre-existing indentation bug in LXC exception handler.
Fixes: #19
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explains that the ticket API deduplicates using SHA-256 hash of
(category + tags + hostname + device), not description/timestamp.
Clarifies the 24-hour dedup window and cluster-wide hostname exclusion.
Fixes: #18
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ridata drives are known unreliable hardware. Instead of skipping them
with no notification, flag as REPLACEMENT_NEEDED and create tickets
recommending replacement.
Resolves #13
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use regex pattern matching instead of split()[N] indexing for parsing
pct df output. This is more robust against variations in column
formatting and whitespace.
Resolves #11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add per-run cache for _get_drive_details() results. Each drive is
queried once via smartctl -i and the result is reused across SMART
health checks and ticket creation.
Resolves #15
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace dual-method detection (lsblk + glob scanning) with single
lsblk -p call that returns full device paths directly. Adds timeout,
returns sorted results for consistency.
Resolves #14
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With 50 drives checked hourly over 30 days, history data can reach ~36MB
which exceeded the old 10MB limit causing constant file churn. Increase
to 50MB and make configurable via HISTORY_MAX_BYTES in .env.
Resolves #12
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change default log level from DEBUG to INFO to reduce noise during
hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging
when needed for troubleshooting.
Resolves #16
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap all int() conversions in try/except to handle malformed .env values
gracefully. Validate TICKET_API_KEY is not empty or placeholder value,
logging a warning instead of raising to preserve dry-run compatibility.
Resolves #17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add escape function to sanitize backslashes, double quotes, and newlines
in label values per Prometheus text format spec. Prevents corrupted
metrics output from model names or paths containing these characters.
Resolves #10
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add timeout=30 to smartctl -i calls in _get_drive_details() and
_check_disk_firmware(), and dmidecode in _check_memory_usage().
Add TimeoutExpired handler in _check_disk_firmware(). Prevents
potential hangs when drives or system tools become unresponsive.
Resolves #9
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace digits[:2] truncation with regex extraction of complete number.
Previously "123°C" would be parsed as 12 instead of 123.
Resolves #8
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check both file existence AND size > 0 before opening in r+ mode.
Previously, an empty file (0 bytes) would be opened in r+ mode, causing
json.load() to fail on the empty content.
Resolves #7
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h)
from inline literals to configurable CONFIG entries with .env support.
Resolves #20
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:00 -05:00
1 changed files with 574 additions and 117 deletions
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.