Commit Graph

167 Commits

Author SHA1 Message Date
0f81d015cd Implement proper ticket categorization based on issue type
Added intelligent categorization to match tickets with correct category and type
instead of defaulting everything to Hardware/Problem.

Changes:
- Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency
- Created _categorize_issue() method to determine proper classification:

  Hardware Issues:
  - SMART/drive/disk errors → Hardware + Incident (critical/failed)
  - SMART warnings → Hardware + Problem (needs investigation)

  Software Issues:
  - LXC/container/storage usage/CPU → Software category
  - Critical levels → Software + Incident (service degradation)
  - Warning levels → Software + Problem (preventive investigation)

  Network Issues:
  - Network failures/unreachable → Network + Incident
  - Network warnings → Network + Problem

- Updated ticket creation to use _categorize_issue() and _determine_ticket_priority()
- Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance]
- Category field in API payload now matches issue type (Hardware/Software/Network)
- Type field in API payload now reflects actual situation (Incident/Problem/Task)

Examples:
- "LXC storage usage >80%" → Software + Problem
- "Critical SMART errors" → Hardware + Incident
- "High CPU usage" → Software + Problem
- "Network unreachable" → Network + Incident

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08 13:26:17 -05:00
88afc8f03e Improve ASCII art formatting in ticket descriptions
Fixed all box alignment issues and improved visual consistency:

- Standardized box width to 78 chars across all sections
- Unified field width calculations (62 chars for values)
- Fixed executive summary box with proper dynamic width
- Fixed drive specifications box alignment
- Fixed drive timeline box with proper field widths
- Fixed SMART status box and improved temperature handling (None check)
- Fixed SMART attributes box with consistent widths
- Improved partition boxes:
  - 50-char usage meter (2% per block) instead of 20-char
  - Added percentage display next to meter
  - Truncate long mountpoints in header to prevent overflow
  - Consistent field widths across all fields
- Fixed firmware alerts box alignment

All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│)
with proper width calculations for perfect alignment.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07 19:44:21 -05:00
63daa57d80 Fix missing drive capacity in ticket titles
Problem: Drive capacity was being extracted but never inserted into ticket titles.
The drive_size variable was calculated from drive details but omitted from the
ticket_title string construction.

Solution: Added drive_size to ticket title format between category and issue.

Example ticket titles now show:
- Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..."
- After:  "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..."

This makes it easier to identify which drives need attention at a glance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:15:02 -05:00
72e61bd94e Add manual execution instructions to README
Added comprehensive manual execution section with both:
- Direct execution from local file (python3 hwmonDaemon.py)
- Remote execution matching systemd service (one-liner download+exec)

Both modes include dry-run and normal execution examples for testing
and production use.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:03:27 -05:00
841db13459 Fix false positive ticket creation for manufacturer operation counters
Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.

Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)

These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.

Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:00:32 -05:00
10b548cd79 Update README with hourly execution schedule and recent improvements
- Document hourly execution (changed from daily)
- Add version 2.0 improvements section
- Document 10MB storage limit and automatic cleanup
- Clarify service configuration details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:57:16 -05:00
fe832c42f3 Fix critical reliability and security issues in hwmonDaemon
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses

Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal

These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:55:48 -05:00
0577c7fc1b add api key support 2026-01-01 16:01:55 -05:00
cc62aabfe4 Merge branch 'main' of 10.10.10.63:LotusGuild/hwmonDaemon 2026-01-01 15:50:30 -05:00
546ef066f8 API Key Auth 2026-01-01 15:45:29 -05:00
9dc3b60a73 Update hwmon.service 2025-11-29 16:04:43 -05:00
0239d64ec3 Update hwmon.service 2025-11-25 20:29:52 -05:00
0326c5142e Updated hdd temp thresholds 2025-09-03 21:06:12 -04:00
0ab728da47 Better manufactuerer detection and values 2025-09-03 13:14:43 -04:00
4b68b0b525 Added custom config for OOS12000G 2025-09-03 13:02:32 -04:00
2d6626cece Fixed thesholds for thermals and smart 2025-09-03 12:58:30 -04:00
bc73a691df data retention and large refactor of codebase 2025-09-03 12:43:16 -04:00
3d902620b0 Removed unnecessary logging 2025-09-02 17:50:05 -04:00
cae4bf031b Updated priority system 2025-08-17 09:48:25 -04:00
fb1a9f67e1 Updated CPU threshold 2025-07-25 17:36:21 -04:00
0faf7654d6 Huge update to vendor profiles 2025-07-24 19:15:21 -04:00
a74c4c0309 Erase_Fail_Count matched two values 2025-06-24 15:14:35 -04:00
9a700e9853 Attempted fix for lxc storage 2025-05-29 20:23:21 -04:00
1371592b9e Update LXC storage utilization function 2025-05-29 20:16:50 -04:00
6907f71de1 Updated LXC storage checks 2025-05-29 19:50:17 -04:00
20eb1f9a11 firmware pattern matching 2025-05-29 19:30:06 -04:00
5ac12fd6b7 Correction of deleted code 2025-05-29 19:04:45 -04:00
1e6260a899 Better identification of RiData drives 2025-05-29 19:02:27 -04:00
95a5a8227a NoneType fix? 2025-05-29 12:44:55 -04:00
f8784eddd2 Added null safety checks 2025-05-29 11:44:07 -04:00
147947b8ca Testing manufacturer specific smart tests 2025-05-28 14:59:47 -04:00
22bdaa9401 Updated ticket priorities for different drive failures 2025-05-14 21:22:44 -04:00
40b7eb5641 Updated indexcies 2025-05-14 21:17:52 -04:00
6fb0d89519 lxc storage indexcises increased by 1 2025-05-14 21:13:09 -04:00
53b9169da2 test single node change 2025-05-14 21:07:59 -04:00
a34b59ad36 Updated drive firmware checks 2025-05-14 21:01:40 -04:00
0384270dfc Sofware failure not hardware 2025-05-12 16:12:46 -04:00
1f52a6b4f5 Full traceback to see where error is 2025-05-12 16:04:43 -04:00
c807a6309a Updated mountpoint catching 2025-05-12 16:00:24 -04:00
3d2fdac3f3 Attempt fix 1 2025-05-12 15:53:32 -04:00
af1121e3d9 Updated drive ticket creation 2025-05-12 15:47:14 -04:00
20f51e0b25 Updated lxc file system matching 2025-05-12 15:41:01 -04:00
4fe0a8dbfc Different variable for issue type 2025-05-12 15:35:42 -04:00
65ba24e46d if drive not in issue 2025-05-12 15:32:54 -04:00
e5175f53e5 debug description 2025-05-12 15:22:46 -04:00
bd6d89c4e3 Update make_box 2025-05-12 15:14:06 -04:00
2a6025f5f2 updated parsing 2025-03-09 22:25:29 -04:00
adafa796f1 adjusted lxc storage 2025-03-09 22:17:10 -04:00
f8ea49f099 updated parse size 2025-03-09 22:12:20 -04:00
6a6a400320 adjust regex 2025-03-09 22:09:24 -04:00