Lightweight /health endpoint returns JSON with status, hostname, and
last check timestamp. Runs as daemon thread, activated via --health-server
flag or HEALTH_SERVER_ENABLED=true in .env config.
Fixes: #21
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs SMART checks concurrently (up to 8 workers) instead of
sequentially, significantly reducing check time on multi-drive systems.
Results are collected and processed in original disk order.
Fixes: #22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Checks availability of required (smartctl, lsblk) and optional (nvme,
ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code
sections to skip gracefully with informative log messages instead of
crashing. Also fixes pre-existing indentation bug in LXC exception handler.
Fixes: #19
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explains that the ticket API deduplicates using SHA-256 hash of
(category + tags + hostname + device), not description/timestamp.
Clarifies the 24-hour dedup window and cluster-wide hostname exclusion.
Fixes: #18
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ridata drives are known unreliable hardware. Instead of skipping them
with no notification, flag as REPLACEMENT_NEEDED and create tickets
recommending replacement.
Resolves #13
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use regex pattern matching instead of split()[N] indexing for parsing
pct df output. This is more robust against variations in column
formatting and whitespace.
Resolves #11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add per-run cache for _get_drive_details() results. Each drive is
queried once via smartctl -i and the result is reused across SMART
health checks and ticket creation.
Resolves #15
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace dual-method detection (lsblk + glob scanning) with single
lsblk -p call that returns full device paths directly. Adds timeout,
returns sorted results for consistency.
Resolves #14
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With 50 drives checked hourly over 30 days, history data can reach ~36MB
which exceeded the old 10MB limit causing constant file churn. Increase
to 50MB and make configurable via HISTORY_MAX_BYTES in .env.
Resolves #12
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change default log level from DEBUG to INFO to reduce noise during
hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging
when needed for troubleshooting.
Resolves #16
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap all int() conversions in try/except to handle malformed .env values
gracefully. Validate TICKET_API_KEY is not empty or placeholder value,
logging a warning instead of raising to preserve dry-run compatibility.
Resolves #17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add escape function to sanitize backslashes, double quotes, and newlines
in label values per Prometheus text format spec. Prevents corrupted
metrics output from model names or paths containing these characters.
Resolves #10
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add timeout=30 to smartctl -i calls in _get_drive_details() and
_check_disk_firmware(), and dmidecode in _check_memory_usage().
Add TimeoutExpired handler in _check_disk_firmware(). Prevents
potential hangs when drives or system tools become unresponsive.
Resolves #9
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace digits[:2] truncation with regex extraction of complete number.
Previously "123°C" would be parsed as 12 instead of 123.
Resolves #8
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check both file existence AND size > 0 before opening in r+ mode.
Previously, an empty file (0 bytes) would be opened in r+ mode, causing
json.load() to fail on the empty content.
Resolves #7
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h)
from inline literals to configurable CONFIG entries with .env support.
Resolves #20
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Exclude manufacturer operation counters (Seek_Error_Rate,
Command_Timeout, Raw_Read_Error_Rate) from critical issue
count to prevent false P1 escalation
- Fix missing space after [ceph] tag in ticket titles
Before: [hostname][auto][ceph]Ceph HEALTH_WARN
After: [hostname][auto][ceph] Ceph HEALTH_WARN
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive Ceph cluster health monitoring
- Check cluster health status (HEALTH_OK/WARN/ERR)
- Monitor cluster usage with configurable thresholds
- Track OSD status (up/down) per node
- Separate cluster-wide vs node-specific issues
- Cluster-wide ticket deduplication
- Add [cluster-wide] scope tag for Ceph issues
- Cluster-wide issues deduplicate across all nodes
- Node-specific issues (OSD down) include hostname
- Add Prometheus metrics export
- export_prometheus_metrics() method
- write_prometheus_metrics() for textfile collector
- --metrics CLI flag to output metrics to stdout
- --export-json CLI flag to export health report as JSON
- Add Grafana dashboard template (grafana-dashboard.json)
- Add .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add P5 (LOW) priority for informational/minimal impact alerts
- Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings
- Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task,
Maintenance, Upgrade, Install, Request)
- Fix TICKET_CATEGORIES to only Hardware and Software
- Add P1 escalation logic via _count_critical_issues() helper
- Rewrite _determine_ticket_priority() with full P1-P5 support
- Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD
- Filter INFO-level alerts from ticket creation by default
- Update _categorize_issue() to use valid ticket types
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added intelligent categorization to match tickets with correct category and type
instead of defaulting everything to Hardware/Problem.
Changes:
- Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency
- Created _categorize_issue() method to determine proper classification:
Hardware Issues:
- SMART/drive/disk errors → Hardware + Incident (critical/failed)
- SMART warnings → Hardware + Problem (needs investigation)
Software Issues:
- LXC/container/storage usage/CPU → Software category
- Critical levels → Software + Incident (service degradation)
- Warning levels → Software + Problem (preventive investigation)
Network Issues:
- Network failures/unreachable → Network + Incident
- Network warnings → Network + Problem
- Updated ticket creation to use _categorize_issue() and _determine_ticket_priority()
- Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance]
- Category field in API payload now matches issue type (Hardware/Software/Network)
- Type field in API payload now reflects actual situation (Incident/Problem/Task)
Examples:
- "LXC storage usage >80%" → Software + Problem
- "Critical SMART errors" → Hardware + Incident
- "High CPU usage" → Software + Problem
- "Network unreachable" → Network + Incident
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed all box alignment issues and improved visual consistency:
- Standardized box width to 78 chars across all sections
- Unified field width calculations (62 chars for values)
- Fixed executive summary box with proper dynamic width
- Fixed drive specifications box alignment
- Fixed drive timeline box with proper field widths
- Fixed SMART status box and improved temperature handling (None check)
- Fixed SMART attributes box with consistent widths
- Improved partition boxes:
- 50-char usage meter (2% per block) instead of 20-char
- Added percentage display next to meter
- Truncate long mountpoints in header to prevent overflow
- Consistent field widths across all fields
- Fixed firmware alerts box alignment
All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│)
with proper width calculations for perfect alignment.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Problem: Drive capacity was being extracted but never inserted into ticket titles.
The drive_size variable was calculated from drive details but omitted from the
ticket_title string construction.
Solution: Added drive_size to ticket title format between category and issue.
Example ticket titles now show:
- Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..."
- After: "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..."
This makes it easier to identify which drives need attention at a glance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added comprehensive manual execution section with both:
- Direct execution from local file (python3 hwmonDaemon.py)
- Remote execution matching systemd service (one-liner download+exec)
Both modes include dry-run and normal execution examples for testing
and production use.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.
Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)
These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.
Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses
Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal
These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>