- Exclude manufacturer operation counters (Seek_Error_Rate,
Command_Timeout, Raw_Read_Error_Rate) from critical issue
count to prevent false P1 escalation
- Fix missing space after [ceph] tag in ticket titles
Before: [hostname][auto][ceph]Ceph HEALTH_WARN
After: [hostname][auto][ceph] Ceph HEALTH_WARN
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive Ceph cluster health monitoring
- Check cluster health status (HEALTH_OK/WARN/ERR)
- Monitor cluster usage with configurable thresholds
- Track OSD status (up/down) per node
- Separate cluster-wide vs node-specific issues
- Cluster-wide ticket deduplication
- Add [cluster-wide] scope tag for Ceph issues
- Cluster-wide issues deduplicate across all nodes
- Node-specific issues (OSD down) include hostname
- Add Prometheus metrics export
- export_prometheus_metrics() method
- write_prometheus_metrics() for textfile collector
- --metrics CLI flag to output metrics to stdout
- --export-json CLI flag to export health report as JSON
- Add Grafana dashboard template (grafana-dashboard.json)
- Add .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add P5 (LOW) priority for informational/minimal impact alerts
- Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings
- Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task,
Maintenance, Upgrade, Install, Request)
- Fix TICKET_CATEGORIES to only Hardware and Software
- Add P1 escalation logic via _count_critical_issues() helper
- Rewrite _determine_ticket_priority() with full P1-P5 support
- Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD
- Filter INFO-level alerts from ticket creation by default
- Update _categorize_issue() to use valid ticket types
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added intelligent categorization to match tickets with correct category and type
instead of defaulting everything to Hardware/Problem.
Changes:
- Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency
- Created _categorize_issue() method to determine proper classification:
Hardware Issues:
- SMART/drive/disk errors → Hardware + Incident (critical/failed)
- SMART warnings → Hardware + Problem (needs investigation)
Software Issues:
- LXC/container/storage usage/CPU → Software category
- Critical levels → Software + Incident (service degradation)
- Warning levels → Software + Problem (preventive investigation)
Network Issues:
- Network failures/unreachable → Network + Incident
- Network warnings → Network + Problem
- Updated ticket creation to use _categorize_issue() and _determine_ticket_priority()
- Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance]
- Category field in API payload now matches issue type (Hardware/Software/Network)
- Type field in API payload now reflects actual situation (Incident/Problem/Task)
Examples:
- "LXC storage usage >80%" → Software + Problem
- "Critical SMART errors" → Hardware + Incident
- "High CPU usage" → Software + Problem
- "Network unreachable" → Network + Incident
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed all box alignment issues and improved visual consistency:
- Standardized box width to 78 chars across all sections
- Unified field width calculations (62 chars for values)
- Fixed executive summary box with proper dynamic width
- Fixed drive specifications box alignment
- Fixed drive timeline box with proper field widths
- Fixed SMART status box and improved temperature handling (None check)
- Fixed SMART attributes box with consistent widths
- Improved partition boxes:
- 50-char usage meter (2% per block) instead of 20-char
- Added percentage display next to meter
- Truncate long mountpoints in header to prevent overflow
- Consistent field widths across all fields
- Fixed firmware alerts box alignment
All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│)
with proper width calculations for perfect alignment.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Problem: Drive capacity was being extracted but never inserted into ticket titles.
The drive_size variable was calculated from drive details but omitted from the
ticket_title string construction.
Solution: Added drive_size to ticket title format between category and issue.
Example ticket titles now show:
- Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..."
- After: "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..."
This makes it easier to identify which drives need attention at a glance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.
Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)
These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.
Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses
Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal
These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>