Commit Graph

190 Commits

Author SHA1 Message Date
jared 90dd8f3390 fix: calibrate SMART thresholds per manufacturer to eliminate false positives
Lint / Python (flake8) (push) Failing after 1m13s
Security / Python Security (bandit) (push) Successful in 46s
Test / Python Tests (pytest) (push) Successful in 1m2s
Lint / Notify on failure (push) Successful in 2s
Investigated all 7 pending drive tickets in the ticketing DB. Identified
3 confirmed false positives and 1 parsing bug. Implemented manufacturer-
specific SMART profiles and a systemic substring-match fix.

Changes:
- Seagate: disable Seek_Error_Rate (packed counter), add High_Fly_Writes
  profile threshold (100/500 vs the old 1/5), disable Command_Timeout
  (packed 3-part 48-bit format on Exos series)
- Western Digital: disable Command_Timeout (same packed format)
- Toshiba: new profile covering MG04-MG10 enterprise and MQ01-MQ04
  consumer series; disable Raw/Seek counters, keep Command_Timeout with
  raised thresholds (1000/5000) since MG-series uses a real simple count;
  add model-prefix detection so MG08ACP16TE etc. match without "TOSHIBA"
  in the model string
- OOS: add OOS14000G alias (fleet has both 12TB and 14TB variants);
  replace billion-scale Command_Timeout threshold with monitor:False
- Samsung: disable Program_Fail_Cnt_Total (attr 181, vendor-encoded),
  Erase_Fail_Count_Chip (attrs 172/176, chip-level internal counter),
  Program_Fail_Count_Chip (attr 171); disable generic Erase_Fail_Count
  and Program_Fail_Count to prevent bleed-through from _Chip lines

Bug fixes:
- Fix substring match: 'Erase_Fail_Count' was matching
  'Erase_Fail_Count_Chip' lines in both the first-pass and main attribute
  loops. Changed to token-boundary check (attr + ' ') in both places.
- Add 32-bit overflow guard: raw SMART values > 0xFFFFFFFF are skipped
  at threshold comparison. Catches 0xFFFFFFFFFFFF sentinel values from
  unrecognized drives (was generating Critical Program_Fail_Cnt_Total
  tickets with value 281474976710655).

BASE_SMART_THRESHOLDS:
- High_Fly_Writes: 1/5 -> 100/500
- Program_Fail_Cnt_Total: 1/5 -> 50/200
- Erase_Fail_Count_Total: 1/5 -> 50/200

Global filtered_issues: removed Seek_Error_Rate and Command_Timeout
(now handled per-profile); Raw_Read_Error_Rate kept as catch-all.

Verified with --dry-run on all 4 servers: compute-storage-01, large1,
compute-storage-gpu-01, pbs. Only legitimate issues surface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 10:09:54 -04:00
jared 607dea3186 Strip volatile values from ticket titles (usage %, OSD counts)
Lint / Python (flake8) (push) Failing after 42s
Security / Python Security (bandit) (push) Successful in 45s
Test / Python Tests (pytest) (push) Successful in 1m40s
Lint / Notify on failure (push) Successful in 4s
- LXC/ZFS storage usage percentages: "usage high: 80.1%" → "usage high"
- LXC "high storage usage: 80.1% on /mnt" → "high storage usage on /mnt"
- Ceph BlueStore OSD counts: "2 OSD(s) experiencing slow" →
  "OSD(s) experiencing slow" (count fluctuates every run)

These changing values embedded in titles were triggering a "Title updated"
comment on every hourly run even though nothing meaningfully changed.
Values are fully retained in the ticket description.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 08:29:42 -04:00
jared 26e2d1cec8 Strip volatile SMART counters from ticket title to stop comment spam
Lint / Python (flake8) (push) Failing after 46s
Security / Python Security (bandit) (push) Successful in 57s
Test / Python Tests (pytest) (push) Successful in 1m29s
Lint / Notify on failure (push) Successful in 3s
Power_On_Hours and other SMART counters embedded in the issue string
were included verbatim in the ticket title. Since the count increments
every hour, the title was "new" on every run, triggering a title-update
comment every single cycle (307 spam comments on two tickets).

Strip ': Warning <attr>: <N>' / ': Critical <attr>: <N>' suffixes from
the title before building the ticket payload. Counter values are still
fully captured in the ticket description.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 08:16:35 -04:00
jared c0c96bf003 Fix description template strings not rendering (missing f-string prefix)
Lint / Notify on failure (push) Has been cancelled
Lint / Python (flake8) (push) Has been cancelled
Security / Python Security (bandit) (push) Has been cancelled
Test / Python Tests (pytest) (push) Has been cancelled
All multi-line ASCII art blocks in _generate_ticket_description were
regular strings, not f-strings — so {hostname}, {'━' * box_width},
{cluster_health}, etc. were sent as literal template text instead of
rendered values. Added f prefix to all affected triple-quoted strings:
banner, executive_summary, DRIVE SPECIFICATIONS, DRIVE TIMELINE,
SMART STATUS, PARTITION, CEPH CLUSTER STATUS, CPU STATUS,
NETWORK STATUS, CONTAINER STORAGE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 08:11:09 -04:00
jared cbbafa05c2 ci: add flake8 lint workflow; fix unused imports and f-string issues
Lint / Python (flake8) (push) Failing after 4s
Adds .gitea/workflows/lint.yml running flake8 with .flake8 config.
Removes unused sys/urllib.request imports (F401).
Removes f prefix from 52 f-strings that had no placeholders (F541).
Auto-fixes trailing whitespace in blank lines (W293) via autopep8.
Fixes over-indentation in LXC storage check try block (E117).
Config ignores F841 (unused locals) and E501 (long lines).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 22:27:15 -04:00
jared 03320c0ece Use drive serial numbers instead of device paths for ticket dedup
Device paths like /dev/sdg are assigned by the kernel at boot and can
change after hot-swaps or reboots, causing duplicate tickets for the
same physical drive under a new letter.

Changes:
- _detect_issues(): issue strings now use serial number (e.g.
  "Drive Z4ZC4B6R has SMART issues: ...") falling back to device
  path only if smartctl cannot return a serial
- _create_tickets_for_issues(): capacity lookup resolves serial →
  device via the details cache instead of regex on the issue string;
  serial is included in the API payload as a dedicated field
- _generate_detailed_description(): drive lookup uses serial match
  instead of /dev/ regex

The tinker_tickets API uses the serial field in the dedup hash so
the same physical drive always maps to the same ticket regardless
of which /dev/sdX letter it occupies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 18:54:26 -04:00
jared d1750ea6cf Add Proxmox Backup Server (PBS) health monitoring support
Monitors ZFS pool status/usage and failed PBS tasks (backup, GC, sync).
Includes configurable thresholds (PBS_ZFS_WARNING/CRITICAL), Prometheus
metrics (hwmon_pbs_*), dry-run summary, issue categorization, and
priority classification. Enabled via PBS_ENABLED=true in .env config.

Fixes: #5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:18:41 -05:00
jared 07782da7b6 Add HTTP health check endpoint on port 9102
Lightweight /health endpoint returns JSON with status, hostname, and
last check timestamp. Runs as daemon thread, activated via --health-server
flag or HEALTH_SERVER_ENABLED=true in .env config.

Fixes: #21

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:15:15 -05:00
jared b02e416117 Parallelize SMART health checks across drives with ThreadPoolExecutor
Runs SMART checks concurrently (up to 8 workers) instead of
sequentially, significantly reducing check time on multi-drive systems.
Results are collected and processed in original disk order.

Fixes: #22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:13:50 -05:00
jared 7b36255fb4 Add graceful degradation when external tools are missing
Checks availability of required (smartctl, lsblk) and optional (nvme,
ceph, pct, dmidecode) tools at startup. Guards all tool-dependent code
sections to skip gracefully with informative log messages instead of
crashing. Also fixes pre-existing indentation bug in LXC exception handler.

Fixes: #19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:13:08 -05:00
jared 92bca248ac Add deduplication clarification comments for Ceph ticket handling
Explains that the ticket API deduplicates using SHA-256 hash of
(category + tags + hostname + device), not description/timestamp.
Clarifies the 24-hour dedup window and cluster-wide hostname exclusion.

Fixes: #18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:03:33 -05:00
jared 4a186fb6d6 Create replacement tickets for Ridata drives instead of silently skipping
Ridata drives are known unreliable hardware. Instead of skipping them
with no notification, flag as REPLACEMENT_NEEDED and create tickets
recommending replacement.

Resolves #13

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:01:24 -05:00
jared 90346a2da1 Replace fragile column-index LXC storage parsing with regex
Use regex pattern matching instead of split()[N] indexing for parsing
pct df output. This is more robust against variations in column
formatting and whitespace.

Resolves #11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:00:50 -05:00
jared 308a8d5c5c Cache drive details to eliminate redundant smartctl calls
Add per-run cache for _get_drive_details() results. Each drive is
queried once via smartctl -i and the result is reused across SMART
health checks and ticket creation.

Resolves #15

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 13:00:25 -05:00
jared 9f9cc1b763 Simplify disk detection to single lsblk call with full paths
Replace dual-method detection (lsblk + glob scanning) with single
lsblk -p call that returns full device paths directly. Adds timeout,
returns sorted results for consistency.

Resolves #14

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:59:49 -05:00
jared ab67d786ce Increase history storage limit to 50MB to match retention needs
With 50 drives checked hourly over 30 days, history data can reach ~36MB
which exceeded the old 10MB limit causing constant file churn. Increase
to 50MB and make configurable via HISTORY_MAX_BYTES in .env.

Resolves #12

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:59:24 -05:00
jared da2de4375e Add verbosity control with -v/--verbose flag
Change default log level from DEBUG to INFO to reduce noise during
hourly execution. Add --verbose/-v CLI flag to enable DEBUG logging
when needed for troubleshooting.

Resolves #16

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:58:43 -05:00
jared 38dd120da2 Add config validation for .env values
Wrap all int() conversions in try/except to handle malformed .env values
gracefully. Validate TICKET_API_KEY is not empty or placeholder value,
logging a warning instead of raising to preserve dry-run compatibility.

Resolves #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:58:02 -05:00
jared 7383a0c674 Escape special characters in Prometheus metric labels
Add escape function to sanitize backslashes, double quotes, and newlines
in label values per Prometheus text format spec. Prevents corrupted
metrics output from model names or paths containing these characters.

Resolves #10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:37 -05:00
jared a3cf5a698f Add missing timeouts to all subprocess calls
Add timeout=30 to smartctl -i calls in _get_drive_details() and
_check_disk_firmware(), and dmidecode in _check_memory_usage().
Add TimeoutExpired handler in _check_disk_firmware(). Prevents
potential hangs when drives or system tools become unresponsive.

Resolves #9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:57:17 -05:00
jared c7309663de Fix NVMe temperature parsing bug for values > 99°C
Replace digits[:2] truncation with regex extraction of complete number.
Previously "123°C" would be parsed as 12 instead of 123.

Resolves #8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:37 -05:00
jared 0559f2d668 Fix file locking race condition in SMART trend analysis
Check both file existence AND size > 0 before opening in r+ mode.
Previously, an empty file (0 bytes) would be opened in r+ mode, causing
json.load() to fail on the empty content.

Resolves #7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:21 -05:00
jared d79005eb42 Centralize hardcoded magic numbers into CONFIG dict
Move NEW_DRIVE_HOURS_THRESHOLD (720h) and SMART_ERROR_RECENT_HOURS (168h)
from inline literals to configurable CONFIG entries with .env support.

Resolves #20

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 12:56:00 -05:00
jared 058ea5ad06 Fix ticket overview: priority display, impact indicators, non-drive detail boxes
- Replace emoji severity indicators (🔴🟡🟢) with ASCII ([CRIT]/[WARN]/[LOW]/[??])
- Fix banner priority to show actual P1-P5 level instead of hardcoded HIGH/MEDIUM
- Add LXC/container keyword detection to _get_issue_type()
- Rewrite _get_impact_level() with storage/CPU awareness to avoid false Critical
- Fix SMART description indentation with textwrap.dedent()
- Fix drive age showing "0 years" for drives < 1 year old (now shows months)
- Remove unused perf_metrics block
- Add structured boxed sections for CPU, Network, Container, and Ceph tickets
- Add _format_bytes_human() helper for LXC storage display

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 19:52:51 -05:00
jared 70b02de104 Fix ticket ASCII art alignment, Ceph classification, and cleanup
- Fix all 9 box sections to produce exactly 80-char lines
- Add Ceph/OSD keyword detection to _get_issue_type()
- Make _get_impact_level() recognize HEALTH_WARN/HEALTH_ERR/DOWN
- Remove unused make_box() method
- Fix wrong Gitea URL (10.10.10.110 → 10.10.10.63) in README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 19:08:56 -05:00
jared 509603843b changed osd down events to be cluster wide and deduplcated 2026-01-26 11:03:55 -05:00
jared 1e84144e29 Fix P1 escalation false positives and Ceph title spacing
- Exclude manufacturer operation counters (Seek_Error_Rate,
  Command_Timeout, Raw_Read_Error_Rate) from critical issue
  count to prevent false P1 escalation

- Fix missing space after [ceph] tag in ticket titles
  Before: [hostname][auto][ceph]Ceph HEALTH_WARN
  After:  [hostname][auto][ceph] Ceph HEALTH_WARN

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:59:58 -05:00
jared 6d959eff02 Fix duplicate [ceph] tag in ticket titles
Remove [ceph] marker from issue text since _categorize_issue
already adds it as the issue_tag.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:56:25 -05:00
jared 0f8918fb8b Add Ceph cluster monitoring and Prometheus metrics export
- Add comprehensive Ceph cluster health monitoring
  - Check cluster health status (HEALTH_OK/WARN/ERR)
  - Monitor cluster usage with configurable thresholds
  - Track OSD status (up/down) per node
  - Separate cluster-wide vs node-specific issues

- Cluster-wide ticket deduplication
  - Add [cluster-wide] scope tag for Ceph issues
  - Cluster-wide issues deduplicate across all nodes
  - Node-specific issues (OSD down) include hostname

- Add Prometheus metrics export
  - export_prometheus_metrics() method
  - write_prometheus_metrics() for textfile collector
  - --metrics CLI flag to output metrics to stdout
  - --export-json CLI flag to export health report as JSON

- Add Grafana dashboard template (grafana-dashboard.json)
- Add .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:54:16 -05:00
jared 3322c5878a Upgrade priority system and fix ticket type alignment
- Add P5 (LOW) priority for informational/minimal impact alerts
- Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings
- Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task,
  Maintenance, Upgrade, Install, Request)
- Fix TICKET_CATEGORIES to only Hardware and Software
- Add P1 escalation logic via _count_critical_issues() helper
- Rewrite _determine_ticket_priority() with full P1-P5 support
- Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD
- Filter INFO-level alerts from ticket creation by default
- Update _categorize_issue() to use valid ticket types

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:24:35 -05:00
jared 0f81d015cd Implement proper ticket categorization based on issue type
Added intelligent categorization to match tickets with correct category and type
instead of defaulting everything to Hardware/Problem.

Changes:
- Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency
- Created _categorize_issue() method to determine proper classification:

  Hardware Issues:
  - SMART/drive/disk errors → Hardware + Incident (critical/failed)
  - SMART warnings → Hardware + Problem (needs investigation)

  Software Issues:
  - LXC/container/storage usage/CPU → Software category
  - Critical levels → Software + Incident (service degradation)
  - Warning levels → Software + Problem (preventive investigation)

  Network Issues:
  - Network failures/unreachable → Network + Incident
  - Network warnings → Network + Problem

- Updated ticket creation to use _categorize_issue() and _determine_ticket_priority()
- Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance]
- Category field in API payload now matches issue type (Hardware/Software/Network)
- Type field in API payload now reflects actual situation (Incident/Problem/Task)

Examples:
- "LXC storage usage >80%" → Software + Problem
- "Critical SMART errors" → Hardware + Incident
- "High CPU usage" → Software + Problem
- "Network unreachable" → Network + Incident

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08 13:26:17 -05:00
jared 88afc8f03e Improve ASCII art formatting in ticket descriptions
Fixed all box alignment issues and improved visual consistency:

- Standardized box width to 78 chars across all sections
- Unified field width calculations (62 chars for values)
- Fixed executive summary box with proper dynamic width
- Fixed drive specifications box alignment
- Fixed drive timeline box with proper field widths
- Fixed SMART status box and improved temperature handling (None check)
- Fixed SMART attributes box with consistent widths
- Improved partition boxes:
  - 50-char usage meter (2% per block) instead of 20-char
  - Added percentage display next to meter
  - Truncate long mountpoints in header to prevent overflow
  - Consistent field widths across all fields
- Fixed firmware alerts box alignment

All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│)
with proper width calculations for perfect alignment.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07 19:44:21 -05:00
jared 63daa57d80 Fix missing drive capacity in ticket titles
Problem: Drive capacity was being extracted but never inserted into ticket titles.
The drive_size variable was calculated from drive details but omitted from the
ticket_title string construction.

Solution: Added drive_size to ticket title format between category and issue.

Example ticket titles now show:
- Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..."
- After:  "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..."

This makes it easier to identify which drives need attention at a glance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:15:02 -05:00
jared 841db13459 Fix false positive ticket creation for manufacturer operation counters
Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.

Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)

These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.

Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:00:32 -05:00
jared fe832c42f3 Fix critical reliability and security issues in hwmonDaemon
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses

Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal

These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:55:48 -05:00
jared 0577c7fc1b add api key support 2026-01-01 16:01:55 -05:00
jared 546ef066f8 API Key Auth 2026-01-01 15:45:29 -05:00
jared 0326c5142e Updated hdd temp thresholds 2025-09-03 21:06:12 -04:00
jared 0ab728da47 Better manufactuerer detection and values 2025-09-03 13:14:43 -04:00
jared 4b68b0b525 Added custom config for OOS12000G 2025-09-03 13:02:32 -04:00
jared 2d6626cece Fixed thesholds for thermals and smart 2025-09-03 12:58:30 -04:00
jared bc73a691df data retention and large refactor of codebase 2025-09-03 12:43:16 -04:00
jared 3d902620b0 Removed unnecessary logging 2025-09-02 17:50:05 -04:00
jared cae4bf031b Updated priority system 2025-08-17 09:48:25 -04:00
jared fb1a9f67e1 Updated CPU threshold 2025-07-25 17:36:21 -04:00
jared 0faf7654d6 Huge update to vendor profiles 2025-07-24 19:15:21 -04:00
jared a74c4c0309 Erase_Fail_Count matched two values 2025-06-24 15:14:35 -04:00
jared 9a700e9853 Attempted fix for lxc storage 2025-05-29 20:23:21 -04:00
jared 1371592b9e Update LXC storage utilization function 2025-05-29 20:16:50 -04:00
jared 6907f71de1 Updated LXC storage checks 2025-05-29 19:50:17 -04:00