Commit Graph

164 Commits

Author SHA1 Message Date
1e84144e29 Fix P1 escalation false positives and Ceph title spacing
- Exclude manufacturer operation counters (Seek_Error_Rate,
  Command_Timeout, Raw_Read_Error_Rate) from critical issue
  count to prevent false P1 escalation

- Fix missing space after [ceph] tag in ticket titles
  Before: [hostname][auto][ceph]Ceph HEALTH_WARN
  After:  [hostname][auto][ceph] Ceph HEALTH_WARN

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:59:58 -05:00
6d959eff02 Fix duplicate [ceph] tag in ticket titles
Remove [ceph] marker from issue text since _categorize_issue
already adds it as the issue_tag.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:56:25 -05:00
0f8918fb8b Add Ceph cluster monitoring and Prometheus metrics export
- Add comprehensive Ceph cluster health monitoring
  - Check cluster health status (HEALTH_OK/WARN/ERR)
  - Monitor cluster usage with configurable thresholds
  - Track OSD status (up/down) per node
  - Separate cluster-wide vs node-specific issues

- Cluster-wide ticket deduplication
  - Add [cluster-wide] scope tag for Ceph issues
  - Cluster-wide issues deduplicate across all nodes
  - Node-specific issues (OSD down) include hostname

- Add Prometheus metrics export
  - export_prometheus_metrics() method
  - write_prometheus_metrics() for textfile collector
  - --metrics CLI flag to output metrics to stdout
  - --export-json CLI flag to export health report as JSON

- Add Grafana dashboard template (grafana-dashboard.json)
- Add .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:54:16 -05:00
3322c5878a Upgrade priority system and fix ticket type alignment
- Add P5 (LOW) priority for informational/minimal impact alerts
- Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings
- Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task,
  Maintenance, Upgrade, Install, Request)
- Fix TICKET_CATEGORIES to only Hardware and Software
- Add P1 escalation logic via _count_critical_issues() helper
- Rewrite _determine_ticket_priority() with full P1-P5 support
- Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD
- Filter INFO-level alerts from ticket creation by default
- Update _categorize_issue() to use valid ticket types

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:24:35 -05:00
0f81d015cd Implement proper ticket categorization based on issue type
Added intelligent categorization to match tickets with correct category and type
instead of defaulting everything to Hardware/Problem.

Changes:
- Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency
- Created _categorize_issue() method to determine proper classification:

  Hardware Issues:
  - SMART/drive/disk errors → Hardware + Incident (critical/failed)
  - SMART warnings → Hardware + Problem (needs investigation)

  Software Issues:
  - LXC/container/storage usage/CPU → Software category
  - Critical levels → Software + Incident (service degradation)
  - Warning levels → Software + Problem (preventive investigation)

  Network Issues:
  - Network failures/unreachable → Network + Incident
  - Network warnings → Network + Problem

- Updated ticket creation to use _categorize_issue() and _determine_ticket_priority()
- Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance]
- Category field in API payload now matches issue type (Hardware/Software/Network)
- Type field in API payload now reflects actual situation (Incident/Problem/Task)

Examples:
- "LXC storage usage >80%" → Software + Problem
- "Critical SMART errors" → Hardware + Incident
- "High CPU usage" → Software + Problem
- "Network unreachable" → Network + Incident

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08 13:26:17 -05:00
88afc8f03e Improve ASCII art formatting in ticket descriptions
Fixed all box alignment issues and improved visual consistency:

- Standardized box width to 78 chars across all sections
- Unified field width calculations (62 chars for values)
- Fixed executive summary box with proper dynamic width
- Fixed drive specifications box alignment
- Fixed drive timeline box with proper field widths
- Fixed SMART status box and improved temperature handling (None check)
- Fixed SMART attributes box with consistent widths
- Improved partition boxes:
  - 50-char usage meter (2% per block) instead of 20-char
  - Added percentage display next to meter
  - Truncate long mountpoints in header to prevent overflow
  - Consistent field widths across all fields
- Fixed firmware alerts box alignment

All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│)
with proper width calculations for perfect alignment.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07 19:44:21 -05:00
63daa57d80 Fix missing drive capacity in ticket titles
Problem: Drive capacity was being extracted but never inserted into ticket titles.
The drive_size variable was calculated from drive details but omitted from the
ticket_title string construction.

Solution: Added drive_size to ticket title format between category and issue.

Example ticket titles now show:
- Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..."
- After:  "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..."

This makes it easier to identify which drives need attention at a glance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:15:02 -05:00
841db13459 Fix false positive ticket creation for manufacturer operation counters
Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.

Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)

These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.

Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:00:32 -05:00
fe832c42f3 Fix critical reliability and security issues in hwmonDaemon
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses

Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal

These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:55:48 -05:00
0577c7fc1b add api key support 2026-01-01 16:01:55 -05:00
546ef066f8 API Key Auth 2026-01-01 15:45:29 -05:00
0326c5142e Updated hdd temp thresholds 2025-09-03 21:06:12 -04:00
0ab728da47 Better manufactuerer detection and values 2025-09-03 13:14:43 -04:00
4b68b0b525 Added custom config for OOS12000G 2025-09-03 13:02:32 -04:00
2d6626cece Fixed thesholds for thermals and smart 2025-09-03 12:58:30 -04:00
bc73a691df data retention and large refactor of codebase 2025-09-03 12:43:16 -04:00
3d902620b0 Removed unnecessary logging 2025-09-02 17:50:05 -04:00
cae4bf031b Updated priority system 2025-08-17 09:48:25 -04:00
fb1a9f67e1 Updated CPU threshold 2025-07-25 17:36:21 -04:00
0faf7654d6 Huge update to vendor profiles 2025-07-24 19:15:21 -04:00
a74c4c0309 Erase_Fail_Count matched two values 2025-06-24 15:14:35 -04:00
9a700e9853 Attempted fix for lxc storage 2025-05-29 20:23:21 -04:00
1371592b9e Update LXC storage utilization function 2025-05-29 20:16:50 -04:00
6907f71de1 Updated LXC storage checks 2025-05-29 19:50:17 -04:00
20eb1f9a11 firmware pattern matching 2025-05-29 19:30:06 -04:00
5ac12fd6b7 Correction of deleted code 2025-05-29 19:04:45 -04:00
1e6260a899 Better identification of RiData drives 2025-05-29 19:02:27 -04:00
95a5a8227a NoneType fix? 2025-05-29 12:44:55 -04:00
f8784eddd2 Added null safety checks 2025-05-29 11:44:07 -04:00
147947b8ca Testing manufacturer specific smart tests 2025-05-28 14:59:47 -04:00
22bdaa9401 Updated ticket priorities for different drive failures 2025-05-14 21:22:44 -04:00
40b7eb5641 Updated indexcies 2025-05-14 21:17:52 -04:00
6fb0d89519 lxc storage indexcises increased by 1 2025-05-14 21:13:09 -04:00
53b9169da2 test single node change 2025-05-14 21:07:59 -04:00
a34b59ad36 Updated drive firmware checks 2025-05-14 21:01:40 -04:00
0384270dfc Sofware failure not hardware 2025-05-12 16:12:46 -04:00
1f52a6b4f5 Full traceback to see where error is 2025-05-12 16:04:43 -04:00
c807a6309a Updated mountpoint catching 2025-05-12 16:00:24 -04:00
3d2fdac3f3 Attempt fix 1 2025-05-12 15:53:32 -04:00
af1121e3d9 Updated drive ticket creation 2025-05-12 15:47:14 -04:00
20f51e0b25 Updated lxc file system matching 2025-05-12 15:41:01 -04:00
4fe0a8dbfc Different variable for issue type 2025-05-12 15:35:42 -04:00
65ba24e46d if drive not in issue 2025-05-12 15:32:54 -04:00
e5175f53e5 debug description 2025-05-12 15:22:46 -04:00
bd6d89c4e3 Update make_box 2025-05-12 15:14:06 -04:00
2a6025f5f2 updated parsing 2025-03-09 22:25:29 -04:00
adafa796f1 adjusted lxc storage 2025-03-09 22:17:10 -04:00
f8ea49f099 updated parse size 2025-03-09 22:12:20 -04:00
6a6a400320 adjust regex 2025-03-09 22:09:24 -04:00
8f87403d48 idk 2025-03-09 22:00:23 -04:00