hwmonDaemon

Author	SHA1	Message	Date
Jared Vititoe	1e84144e29	Fix P1 escalation false positives and Ceph title spacing - Exclude manufacturer operation counters (Seek_Error_Rate, Command_Timeout, Raw_Read_Error_Rate) from critical issue count to prevent false P1 escalation - Fix missing space after [ceph] tag in ticket titles Before: [hostname][auto][ceph]Ceph HEALTH_WARN After: [hostname][auto][ceph] Ceph HEALTH_WARN Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:59:58 -05:00
Jared Vititoe	6d959eff02	Fix duplicate [ceph] tag in ticket titles Remove [ceph] marker from issue text since _categorize_issue already adds it as the issue_tag. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:56:25 -05:00
Jared Vititoe	0f8918fb8b	Add Ceph cluster monitoring and Prometheus metrics export - Add comprehensive Ceph cluster health monitoring - Check cluster health status (HEALTH_OK/WARN/ERR) - Monitor cluster usage with configurable thresholds - Track OSD status (up/down) per node - Separate cluster-wide vs node-specific issues - Cluster-wide ticket deduplication - Add [cluster-wide] scope tag for Ceph issues - Cluster-wide issues deduplicate across all nodes - Node-specific issues (OSD down) include hostname - Add Prometheus metrics export - export_prometheus_metrics() method - write_prometheus_metrics() for textfile collector - --metrics CLI flag to output metrics to stdout - --export-json CLI flag to export health report as JSON - Add Grafana dashboard template (grafana-dashboard.json) - Add .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:54:16 -05:00
Jared Vititoe	3322c5878a	Upgrade priority system and fix ticket type alignment - Add P5 (LOW) priority for informational/minimal impact alerts - Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings - Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task, Maintenance, Upgrade, Install, Request) - Fix TICKET_CATEGORIES to only Hardware and Software - Add P1 escalation logic via _count_critical_issues() helper - Rewrite _determine_ticket_priority() with full P1-P5 support - Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD - Filter INFO-level alerts from ticket creation by default - Update _categorize_issue() to use valid ticket types Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:24:35 -05:00
Jared Vititoe	0f81d015cd	Implement proper ticket categorization based on issue type Added intelligent categorization to match tickets with correct category and type instead of defaulting everything to Hardware/Problem. Changes: - Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency - Created _categorize_issue() method to determine proper classification: Hardware Issues: - SMART/drive/disk errors → Hardware + Incident (critical/failed) - SMART warnings → Hardware + Problem (needs investigation) Software Issues: - LXC/container/storage usage/CPU → Software category - Critical levels → Software + Incident (service degradation) - Warning levels → Software + Problem (preventive investigation) Network Issues: - Network failures/unreachable → Network + Incident - Network warnings → Network + Problem - Updated ticket creation to use _categorize_issue() and _determine_ticket_priority() - Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance] - Category field in API payload now matches issue type (Hardware/Software/Network) - Type field in API payload now reflects actual situation (Incident/Problem/Task) Examples: - "LXC storage usage >80%" → Software + Problem - "Critical SMART errors" → Hardware + Incident - "High CPU usage" → Software + Problem - "Network unreachable" → Network + Incident Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-08 13:26:17 -05:00
Jared Vititoe	88afc8f03e	Improve ASCII art formatting in ticket descriptions Fixed all box alignment issues and improved visual consistency: - Standardized box width to 78 chars across all sections - Unified field width calculations (62 chars for values) - Fixed executive summary box with proper dynamic width - Fixed drive specifications box alignment - Fixed drive timeline box with proper field widths - Fixed SMART status box and improved temperature handling (None check) - Fixed SMART attributes box with consistent widths - Improved partition boxes: - 50-char usage meter (2% per block) instead of 20-char - Added percentage display next to meter - Truncate long mountpoints in header to prevent overflow - Consistent field widths across all fields - Fixed firmware alerts box alignment All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│) with proper width calculations for perfect alignment. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-07 19:44:21 -05:00
Jared Vititoe	63daa57d80	Fix missing drive capacity in ticket titles Problem: Drive capacity was being extracted but never inserted into ticket titles. The drive_size variable was calculated from drive details but omitted from the ticket_title string construction. Solution: Added drive_size to ticket title format between category and issue. Example ticket titles now show: - Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..." - After: "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..." This makes it easier to identify which drives need attention at a glance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 17:15:02 -05:00
Jared Vititoe	841db13459	Fix false positive ticket creation for manufacturer operation counters Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate" and "Critical Command_Timeout" even though these are operation counters used by the manufacturer, not actual errors. Solution: Added filtering in _detect_issues() method to skip known manufacturer operation counters: - Seek_Error_Rate (Seagate/WD operation counter) - Command_Timeout (OOS/Seagate operation counter) - Raw_Read_Error_Rate (Seagate/WD operation counter) These attributes are already correctly excluded from monitoring in manufacturer profiles, but were still appearing in smart_issues list. This fix prevents them from creating tickets while still catching legitimate SMART errors. Changes: - hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues() - Added debug logging when filtering manufacturer counters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 17:00:32 -05:00
Jared Vititoe	fe832c42f3	Fix critical reliability and security issues in hwmonDaemon Critical fixes implemented: - Add 10MB storage limit with automatic cleanup of old history files - Add file locking (fcntl) to prevent race conditions in history writes - Disable SMART monitoring for unreliable Ridata drives - Fix bare except clause in _read_ecc_count() to properly catch errors - Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess) - Fix unchecked regex in ticket creation to prevent AttributeError - Add JSON decode error handling for ticket API responses Service configuration improvements: - hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true - hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal These changes improve reliability, prevent hung processes, eliminate race conditions, and add proper error handling throughout the daemon. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 16:55:48 -05:00
Jared Vititoe	0577c7fc1b	add api key support	2026-01-01 16:01:55 -05:00
Jared Vititoe	546ef066f8	API Key Auth	2026-01-01 15:45:29 -05:00
Jared Vititoe	0326c5142e	Updated hdd temp thresholds	2025-09-03 21:06:12 -04:00
Jared Vititoe	0ab728da47	Better manufactuerer detection and values	2025-09-03 13:14:43 -04:00
Jared Vititoe	4b68b0b525	Added custom config for OOS12000G	2025-09-03 13:02:32 -04:00
Jared Vititoe	2d6626cece	Fixed thesholds for thermals and smart	2025-09-03 12:58:30 -04:00
Jared Vititoe	bc73a691df	data retention and large refactor of codebase	2025-09-03 12:43:16 -04:00
Jared Vititoe	3d902620b0	Removed unnecessary logging	2025-09-02 17:50:05 -04:00
Jared Vititoe	cae4bf031b	Updated priority system	2025-08-17 09:48:25 -04:00
Jared Vititoe	fb1a9f67e1	Updated CPU threshold	2025-07-25 17:36:21 -04:00
Jared Vititoe	0faf7654d6	Huge update to vendor profiles	2025-07-24 19:15:21 -04:00
Jared Vititoe	a74c4c0309	Erase_Fail_Count matched two values	2025-06-24 15:14:35 -04:00
Jared Vititoe	9a700e9853	Attempted fix for lxc storage	2025-05-29 20:23:21 -04:00
Jared Vititoe	1371592b9e	Update LXC storage utilization function	2025-05-29 20:16:50 -04:00
Jared Vititoe	6907f71de1	Updated LXC storage checks	2025-05-29 19:50:17 -04:00
Jared Vititoe	20eb1f9a11	firmware pattern matching	2025-05-29 19:30:06 -04:00
Jared Vititoe	5ac12fd6b7	Correction of deleted code	2025-05-29 19:04:45 -04:00
Jared Vititoe	1e6260a899	Better identification of RiData drives	2025-05-29 19:02:27 -04:00
Jared Vititoe	95a5a8227a	NoneType fix?	2025-05-29 12:44:55 -04:00
Jared Vititoe	f8784eddd2	Added null safety checks	2025-05-29 11:44:07 -04:00
Jared Vititoe	147947b8ca	Testing manufacturer specific smart tests	2025-05-28 14:59:47 -04:00
Jared Vititoe	22bdaa9401	Updated ticket priorities for different drive failures	2025-05-14 21:22:44 -04:00
Jared Vititoe	40b7eb5641	Updated indexcies	2025-05-14 21:17:52 -04:00
Jared Vititoe	6fb0d89519	lxc storage indexcises increased by 1	2025-05-14 21:13:09 -04:00
Jared Vititoe	53b9169da2	test single node change	2025-05-14 21:07:59 -04:00
Jared Vititoe	a34b59ad36	Updated drive firmware checks	2025-05-14 21:01:40 -04:00
Jared Vititoe	0384270dfc	Sofware failure not hardware	2025-05-12 16:12:46 -04:00
Jared Vititoe	1f52a6b4f5	Full traceback to see where error is	2025-05-12 16:04:43 -04:00
Jared Vititoe	c807a6309a	Updated mountpoint catching	2025-05-12 16:00:24 -04:00
Jared Vititoe	3d2fdac3f3	Attempt fix 1	2025-05-12 15:53:32 -04:00
Jared Vititoe	af1121e3d9	Updated drive ticket creation	2025-05-12 15:47:14 -04:00
Jared Vititoe	20f51e0b25	Updated lxc file system matching	2025-05-12 15:41:01 -04:00
Jared Vititoe	4fe0a8dbfc	Different variable for issue type	2025-05-12 15:35:42 -04:00
Jared Vititoe	65ba24e46d	if drive not in issue	2025-05-12 15:32:54 -04:00
Jared Vititoe	e5175f53e5	debug description	2025-05-12 15:22:46 -04:00
Jared Vititoe	bd6d89c4e3	Update make_box	2025-05-12 15:14:06 -04:00
Jared Vititoe	2a6025f5f2	updated parsing	2025-03-09 22:25:29 -04:00
Jared Vititoe	adafa796f1	adjusted lxc storage	2025-03-09 22:17:10 -04:00
Jared Vititoe	f8ea49f099	updated parse size	2025-03-09 22:12:20 -04:00
Jared Vititoe	6a6a400320	adjust regex	2025-03-09 22:09:24 -04:00
Jared Vititoe	8f87403d48	idk	2025-03-09 22:00:23 -04:00

1 2 3 4

164 Commits