hwmonDaemon

Author	SHA1	Message	Date
jared	07782da7b6	Add HTTP health check endpoint on port 9102 Lightweight /health endpoint returns JSON with status, hostname, and last check timestamp. Runs as daemon thread, activated via --health-server flag or HEALTH_SERVER_ENABLED=true in .env config. Fixes: #21 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:15:15 -05:00
jared	b02e416117	Parallelize SMART health checks across drives with ThreadPoolExecutor Runs SMART checks concurrently (up to 8 workers) instead of sequentially, significantly reducing check time on multi-drive systems. Results are collected and processed in original disk order. Fixes: #22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:13:50 -05:00
jared	7b36255fb4	Add graceful degradation when external tools are missing Checks availability of required (smartctl, lsblk) and optional (nvme, ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code sections to skip gracefully with informative log messages instead of crashing. Also fixes pre-existing indentation bug in LXC exception handler. Fixes: #19 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:13:08 -05:00
jared	92bca248ac	Add deduplication clarification comments for Ceph ticket handling Explains that the ticket API deduplicates using SHA-256 hash of (category + tags + hostname + device), not description/timestamp. Clarifies the 24-hour dedup window and cluster-wide hostname exclusion. Fixes: #18 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:03:33 -05:00
jared	4a186fb6d6	Create replacement tickets for Ridata drives instead of silently skipping Ridata drives are known unreliable hardware. Instead of skipping them with no notification, flag as REPLACEMENT_NEEDED and create tickets recommending replacement. Resolves #13 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:01:24 -05:00
jared	90346a2da1	Replace fragile column-index LXC storage parsing with regex Use regex pattern matching instead of split()[N] indexing for parsing pct df output. This is more robust against variations in column formatting and whitespace. Resolves #11 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:00:50 -05:00
jared	308a8d5c5c	Cache drive details to eliminate redundant smartctl calls Add per-run cache for _get_drive_details() results. Each drive is queried once via smartctl -i and the result is reused across SMART health checks and ticket creation. Resolves #15 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 13:00:25 -05:00
jared	9f9cc1b763	Simplify disk detection to single lsblk call with full paths Replace dual-method detection (lsblk + glob scanning) with single lsblk -p call that returns full device paths directly. Adds timeout, returns sorted results for consistency. Resolves #14 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:59:49 -05:00
jared	ab67d786ce	Increase history storage limit to 50MB to match retention needs With 50 drives checked hourly over 30 days, history data can reach ~36MB which exceeded the old 10MB limit causing constant file churn. Increase to 50MB and make configurable via HISTORY_MAX_BYTES in .env. Resolves #12 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:59:24 -05:00
jared	da2de4375e	Add verbosity control with -v/--verbose flag Change default log level from DEBUG to INFO to reduce noise during hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging when needed for troubleshooting. Resolves #16 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:58:43 -05:00
jared	38dd120da2	Add config validation for .env values Wrap all int() conversions in try/except to handle malformed .env values gracefully. Validate TICKET_API_KEY is not empty or placeholder value, logging a warning instead of raising to preserve dry-run compatibility. Resolves #17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:58:02 -05:00
jared	7383a0c674	Escape special characters in Prometheus metric labels Add escape function to sanitize backslashes, double quotes, and newlines in label values per Prometheus text format spec. Prevents corrupted metrics output from model names or paths containing these characters. Resolves #10 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:57:37 -05:00
jared	a3cf5a698f	Add missing timeouts to all subprocess calls Add timeout=30 to smartctl -i calls in _get_drive_details() and _check_disk_firmware(), and dmidecode in _check_memory_usage(). Add TimeoutExpired handler in _check_disk_firmware(). Prevents potential hangs when drives or system tools become unresponsive. Resolves #9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:57:17 -05:00
jared	c7309663de	Fix NVMe temperature parsing bug for values > 99°C Replace digits[:2] truncation with regex extraction of complete number. Previously "123°C" would be parsed as 12 instead of 123. Resolves #8 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:56:37 -05:00
jared	0559f2d668	Fix file locking race condition in SMART trend analysis Check both file existence AND size > 0 before opening in r+ mode. Previously, an empty file (0 bytes) would be opened in r+ mode, causing json.load() to fail on the empty content. Resolves #7 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:56:21 -05:00
jared	d79005eb42	Centralize hardcoded magic numbers into CONFIG dict Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h) from inline literals to configurable CONFIG entries with .env support. Resolves #20 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 12:56:00 -05:00
jared	f44fce2ba7	deleted python cache	2026-02-06 21:31:35 -05:00
jared	058ea5ad06	Fix ticket overview: priority display, impact indicators, non-drive detail boxes - Replace emoji severity indicators (🔴🟡🟢⚪) with ASCII ([CRIT]/[WARN]/[LOW]/[??]) - Fix banner priority to show actual P1-P5 level instead of hardcoded HIGH/MEDIUM - Add LXC/container keyword detection to _get_issue_type() - Rewrite _get_impact_level() with storage/CPU awareness to avoid false Critical - Fix SMART description indentation with textwrap.dedent() - Fix drive age showing "0 years" for drives < 1 year old (now shows months) - Remove unused perf_metrics block - Add structured boxed sections for CPU, Network, Container, and Ceph tickets - Add _format_bytes_human() helper for LXC storage display Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 19:52:51 -05:00
jared	70b02de104	Fix ticket ASCII art alignment, Ceph classification, and cleanup - Fix all 9 box sections to produce exactly 80-char lines - Add Ceph/OSD keyword detection to _get_issue_type() - Make _get_impact_level() recognize HEALTH_WARN/HEALTH_ERR/DOWN - Remove unused make_box() method - Fix wrong Gitea URL (10.10.10.110 → 10.10.10.63) in README Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 19:08:56 -05:00
jared	509603843b	changed osd down events to be cluster wide and deduplcated	2026-01-26 11:03:55 -05:00
jared	1e84144e29	Fix P1 escalation false positives and Ceph title spacing - Exclude manufacturer operation counters (Seek_Error_Rate, Command_Timeout, Raw_Read_Error_Rate) from critical issue count to prevent false P1 escalation - Fix missing space after [ceph] tag in ticket titles Before: [hostname][auto][ceph]Ceph HEALTH_WARN After: [hostname][auto][ceph] Ceph HEALTH_WARN Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:59:58 -05:00
jared	6d959eff02	Fix duplicate [ceph] tag in ticket titles Remove [ceph] marker from issue text since _categorize_issue already adds it as the issue_tag. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:56:25 -05:00
jared	0f8918fb8b	Add Ceph cluster monitoring and Prometheus metrics export - Add comprehensive Ceph cluster health monitoring - Check cluster health status (HEALTH_OK/WARN/ERR) - Monitor cluster usage with configurable thresholds - Track OSD status (up/down) per node - Separate cluster-wide vs node-specific issues - Cluster-wide ticket deduplication - Add [cluster-wide] scope tag for Ceph issues - Cluster-wide issues deduplicate across all nodes - Node-specific issues (OSD down) include hostname - Add Prometheus metrics export - export_prometheus_metrics() method - write_prometheus_metrics() for textfile collector - --metrics CLI flag to output metrics to stdout - --export-json CLI flag to export health report as JSON - Add Grafana dashboard template (grafana-dashboard.json) - Add .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:54:16 -05:00
jared	3322c5878a	Upgrade priority system and fix ticket type alignment - Add P5 (LOW) priority for informational/minimal impact alerts - Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings - Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task, Maintenance, Upgrade, Install, Request) - Fix TICKET_CATEGORIES to only Hardware and Software - Add P1 escalation logic via _count_critical_issues() helper - Rewrite _determine_ticket_priority() with full P1-P5 support - Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD - Filter INFO-level alerts from ticket creation by default - Update _categorize_issue() to use valid ticket types Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-17 15:24:35 -05:00
jared	87b16ca822	Update README.md	2026-01-12 16:38:36 -05:00
jared	0f81d015cd	Implement proper ticket categorization based on issue type Added intelligent categorization to match tickets with correct category and type instead of defaulting everything to Hardware/Problem. Changes: - Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency - Created _categorize_issue() method to determine proper classification: Hardware Issues: - SMART/drive/disk errors → Hardware + Incident (critical/failed) - SMART warnings → Hardware + Problem (needs investigation) Software Issues: - LXC/container/storage usage/CPU → Software category - Critical levels → Software + Incident (service degradation) - Warning levels → Software + Problem (preventive investigation) Network Issues: - Network failures/unreachable → Network + Incident - Network warnings → Network + Problem - Updated ticket creation to use _categorize_issue() and _determine_ticket_priority() - Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance] - Category field in API payload now matches issue type (Hardware/Software/Network) - Type field in API payload now reflects actual situation (Incident/Problem/Task) Examples: - "LXC storage usage >80%" → Software + Problem - "Critical SMART errors" → Hardware + Incident - "High CPU usage" → Software + Problem - "Network unreachable" → Network + Incident Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-08 13:26:17 -05:00
jared	88afc8f03e	Improve ASCII art formatting in ticket descriptions Fixed all box alignment issues and improved visual consistency: - Standardized box width to 78 chars across all sections - Unified field width calculations (62 chars for values) - Fixed executive summary box with proper dynamic width - Fixed drive specifications box alignment - Fixed drive timeline box with proper field widths - Fixed SMART status box and improved temperature handling (None check) - Fixed SMART attributes box with consistent widths - Improved partition boxes: - 50-char usage meter (2% per block) instead of 20-char - Added percentage display next to meter - Truncate long mountpoints in header to prevent overflow - Consistent field widths across all fields - Fixed firmware alerts box alignment All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│) with proper width calculations for perfect alignment. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-07 19:44:21 -05:00
jared	63daa57d80	Fix missing drive capacity in ticket titles Problem: Drive capacity was being extracted but never inserted into ticket titles. The drive_size variable was calculated from drive details but omitted from the ticket_title string construction. Solution: Added drive_size to ticket title format between category and issue. Example ticket titles now show: - Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..." - After: "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..." This makes it easier to identify which drives need attention at a glance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 17:15:02 -05:00
jared	72e61bd94e	Add manual execution instructions to README Added comprehensive manual execution section with both: - Direct execution from local file (python3 hwmonDaemon.py) - Remote execution matching systemd service (one-liner download+exec) Both modes include dry-run and normal execution examples for testing and production use. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 17:03:27 -05:00
jared	841db13459	Fix false positive ticket creation for manufacturer operation counters Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate" and "Critical Command_Timeout" even though these are operation counters used by the manufacturer, not actual errors. Solution: Added filtering in _detect_issues() method to skip known manufacturer operation counters: - Seek_Error_Rate (Seagate/WD operation counter) - Command_Timeout (OOS/Seagate operation counter) - Raw_Read_Error_Rate (Seagate/WD operation counter) These attributes are already correctly excluded from monitoring in manufacturer profiles, but were still appearing in smart_issues list. This fix prevents them from creating tickets while still catching legitimate SMART errors. Changes: - hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues() - Added debug logging when filtering manufacturer counters 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 17:00:32 -05:00
jared	10b548cd79	Update README with hourly execution schedule and recent improvements - Document hourly execution (changed from daily) - Add version 2.0 improvements section - Document 10MB storage limit and automatic cleanup - Clarify service configuration details 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 16:57:16 -05:00
jared	fe832c42f3	Fix critical reliability and security issues in hwmonDaemon Critical fixes implemented: - Add 10MB storage limit with automatic cleanup of old history files - Add file locking (fcntl) to prevent race conditions in history writes - Disable SMART monitoring for unreliable Ridata drives - Fix bare except clause in _read_ecc_count() to properly catch errors - Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess) - Fix unchecked regex in ticket creation to prevent AttributeError - Add JSON decode error handling for ticket API responses Service configuration improvements: - hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true - hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal These changes improve reliability, prevent hung processes, eliminate race conditions, and add proper error handling throughout the daemon. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 16:55:48 -05:00
jared	0577c7fc1b	add api key support	2026-01-01 16:01:55 -05:00
jared	cc62aabfe4	Merge branch 'main' of 10.10.10.63:LotusGuild/hwmonDaemon	2026-01-01 15:50:30 -05:00
jared	546ef066f8	API Key Auth	2026-01-01 15:45:29 -05:00
jared	9dc3b60a73	Update hwmon.service	2025-11-29 16:04:43 -05:00
jared	0239d64ec3	Update hwmon.service	2025-11-25 20:29:52 -05:00
jared	0326c5142e	Updated hdd temp thresholds	2025-09-03 21:06:12 -04:00
jared	0ab728da47	Better manufactuerer detection and values	2025-09-03 13:14:43 -04:00
jared	4b68b0b525	Added custom config for OOS12000G	2025-09-03 13:02:32 -04:00
jared	2d6626cece	Fixed thesholds for thermals and smart	2025-09-03 12:58:30 -04:00
jared	bc73a691df	data retention and large refactor of codebase	2025-09-03 12:43:16 -04:00
jared	3d902620b0	Removed unnecessary logging	2025-09-02 17:50:05 -04:00
jared	cae4bf031b	Updated priority system	2025-08-17 09:48:25 -04:00
jared	fb1a9f67e1	Updated CPU threshold	2025-07-25 17:36:21 -04:00
jared	0faf7654d6	Huge update to vendor profiles	2025-07-24 19:15:21 -04:00
jared	a74c4c0309	Erase_Fail_Count matched two values	2025-06-24 15:14:35 -04:00
jared	9a700e9853	Attempted fix for lxc storage	2025-05-29 20:23:21 -04:00
jared	1371592b9e	Update LXC storage utilization function	2025-05-29 20:16:50 -04:00
jared	6907f71de1	Updated LXC storage checks	2025-05-29 19:50:17 -04:00

1 2 3 4

192 Commits