Compare commits

...

16 Commits

Author SHA1 Message Date
509603843b changed osd down events to be cluster wide and deduplcated 2026-01-26 11:03:55 -05:00
1e84144e29 Fix P1 escalation false positives and Ceph title spacing
- Exclude manufacturer operation counters (Seek_Error_Rate,
  Command_Timeout, Raw_Read_Error_Rate) from critical issue
  count to prevent false P1 escalation

- Fix missing space after [ceph] tag in ticket titles
  Before: [hostname][auto][ceph]Ceph HEALTH_WARN
  After:  [hostname][auto][ceph] Ceph HEALTH_WARN

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:59:58 -05:00
6d959eff02 Fix duplicate [ceph] tag in ticket titles
Remove [ceph] marker from issue text since _categorize_issue
already adds it as the issue_tag.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:56:25 -05:00
0f8918fb8b Add Ceph cluster monitoring and Prometheus metrics export
- Add comprehensive Ceph cluster health monitoring
  - Check cluster health status (HEALTH_OK/WARN/ERR)
  - Monitor cluster usage with configurable thresholds
  - Track OSD status (up/down) per node
  - Separate cluster-wide vs node-specific issues

- Cluster-wide ticket deduplication
  - Add [cluster-wide] scope tag for Ceph issues
  - Cluster-wide issues deduplicate across all nodes
  - Node-specific issues (OSD down) include hostname

- Add Prometheus metrics export
  - export_prometheus_metrics() method
  - write_prometheus_metrics() for textfile collector
  - --metrics CLI flag to output metrics to stdout
  - --export-json CLI flag to export health report as JSON

- Add Grafana dashboard template (grafana-dashboard.json)
- Add .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:54:16 -05:00
3322c5878a Upgrade priority system and fix ticket type alignment
- Add P5 (LOW) priority for informational/minimal impact alerts
- Expand ISSUE_PRIORITIES from 7 to 40+ comprehensive mappings
- Fix TICKET_TYPES to match tinker_tickets API (Issue, Problem, Task,
  Maintenance, Upgrade, Install, Request)
- Fix TICKET_CATEGORIES to only Hardware and Software
- Add P1 escalation logic via _count_critical_issues() helper
- Rewrite _determine_ticket_priority() with full P1-P5 support
- Add CONFIG options: INCLUDE_INFO_TICKETS, PRIORITY_ESCALATION_THRESHOLD
- Filter INFO-level alerts from ticket creation by default
- Update _categorize_issue() to use valid ticket types

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:24:35 -05:00
87b16ca822 Update README.md 2026-01-12 16:38:36 -05:00
0f81d015cd Implement proper ticket categorization based on issue type
Added intelligent categorization to match tickets with correct category and type
instead of defaulting everything to Hardware/Problem.

Changes:
- Added TICKET_CATEGORIES and TICKET_TYPES mappings for API consistency
- Created _categorize_issue() method to determine proper classification:

  Hardware Issues:
  - SMART/drive/disk errors → Hardware + Incident (critical/failed)
  - SMART warnings → Hardware + Problem (needs investigation)

  Software Issues:
  - LXC/container/storage usage/CPU → Software category
  - Critical levels → Software + Incident (service degradation)
  - Warning levels → Software + Problem (preventive investigation)

  Network Issues:
  - Network failures/unreachable → Network + Incident
  - Network warnings → Network + Problem

- Updated ticket creation to use _categorize_issue() and _determine_ticket_priority()
- Tickets now have correct tags: [incident] vs [problem] instead of always [maintenance]
- Category field in API payload now matches issue type (Hardware/Software/Network)
- Type field in API payload now reflects actual situation (Incident/Problem/Task)

Examples:
- "LXC storage usage >80%" → Software + Problem
- "Critical SMART errors" → Hardware + Incident
- "High CPU usage" → Software + Problem
- "Network unreachable" → Network + Incident

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08 13:26:17 -05:00
88afc8f03e Improve ASCII art formatting in ticket descriptions
Fixed all box alignment issues and improved visual consistency:

- Standardized box width to 78 chars across all sections
- Unified field width calculations (62 chars for values)
- Fixed executive summary box with proper dynamic width
- Fixed drive specifications box alignment
- Fixed drive timeline box with proper field widths
- Fixed SMART status box and improved temperature handling (None check)
- Fixed SMART attributes box with consistent widths
- Improved partition boxes:
  - 50-char usage meter (2% per block) instead of 20-char
  - Added percentage display next to meter
  - Truncate long mountpoints in header to prevent overflow
  - Consistent field widths across all fields
- Fixed firmware alerts box alignment

All boxes now use consistent Unicode box-drawing characters (┏━┓┃┗┛│)
with proper width calculations for perfect alignment.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07 19:44:21 -05:00
63daa57d80 Fix missing drive capacity in ticket titles
Problem: Drive capacity was being extracted but never inserted into ticket titles.
The drive_size variable was calculated from drive details but omitted from the
ticket_title string construction.

Solution: Added drive_size to ticket title format between category and issue.

Example ticket titles now show:
- Before: "[hostname][auto][hardware]Drive /dev/sda has SMART issues..."
- After:  "[hostname][auto][hardware][16.0 TB] Drive /dev/sda has SMART issues..."

This makes it easier to identify which drives need attention at a glance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:15:02 -05:00
72e61bd94e Add manual execution instructions to README
Added comprehensive manual execution section with both:
- Direct execution from local file (python3 hwmonDaemon.py)
- Remote execution matching systemd service (one-liner download+exec)

Both modes include dry-run and normal execution examples for testing
and production use.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:03:27 -05:00
841db13459 Fix false positive ticket creation for manufacturer operation counters
Problem: Seagate drives were triggering tickets for "Critical Seek_Error_Rate"
and "Critical Command_Timeout" even though these are operation counters used by
the manufacturer, not actual errors.

Solution: Added filtering in _detect_issues() method to skip known manufacturer
operation counters:
- Seek_Error_Rate (Seagate/WD operation counter)
- Command_Timeout (OOS/Seagate operation counter)
- Raw_Read_Error_Rate (Seagate/WD operation counter)

These attributes are already correctly excluded from monitoring in manufacturer
profiles, but were still appearing in smart_issues list. This fix prevents them
from creating tickets while still catching legitimate SMART errors.

Changes:
- hwmonDaemon.py:1351-1378 - Added operation counter filtering in _detect_issues()
- Added debug logging when filtering manufacturer counters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 17:00:32 -05:00
10b548cd79 Update README with hourly execution schedule and recent improvements
- Document hourly execution (changed from daily)
- Add version 2.0 improvements section
- Document 10MB storage limit and automatic cleanup
- Clarify service configuration details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:57:16 -05:00
fe832c42f3 Fix critical reliability and security issues in hwmonDaemon
Critical fixes implemented:
- Add 10MB storage limit with automatic cleanup of old history files
- Add file locking (fcntl) to prevent race conditions in history writes
- Disable SMART monitoring for unreliable Ridata drives
- Fix bare except clause in _read_ecc_count() to properly catch errors
- Add timeouts to all network and subprocess calls (10s for API, 30s for subprocess)
- Fix unchecked regex in ticket creation to prevent AttributeError
- Add JSON decode error handling for ticket API responses

Service configuration improvements:
- hwmon.timer: Reduce jitter from 300s to 60s, add Persistent=true
- hwmon.service: Add Restart=on-failure, TimeoutStartSec=300, logging to journal

These changes improve reliability, prevent hung processes, eliminate race
conditions, and add proper error handling throughout the daemon.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 16:55:48 -05:00
0577c7fc1b add api key support 2026-01-01 16:01:55 -05:00
cc62aabfe4 Merge branch 'main' of 10.10.10.63:LotusGuild/hwmonDaemon 2026-01-01 15:50:30 -05:00
546ef066f8 API Key Auth 2026-01-01 15:45:29 -05:00
7 changed files with 1564 additions and 209 deletions

2
.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
.claude
settings.local.json

View File

@@ -37,11 +37,13 @@ sudo systemctl start hwmon.timer
### One liner (run as root)
```bash
curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/JWS/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer
curl -o /etc/systemd/system/hwmon.service http://10.10.10.110:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.service && curl -o /etc/systemd/system/hwmon.timer http://10.10.10.110:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmon.timer && systemctl daemon-reload && systemctl enable hwmon.timer && systemctl start hwmon.timer
```
## Manual Execution
### Direct Execution (from local file)
1. Run the daemon with dry-run mode to test:
```bash
python3 hwmonDaemon.py --dry-run
@@ -51,6 +53,20 @@ python3 hwmonDaemon.py --dry-run
python3 hwmonDaemon.py
```
### Remote Execution (same as systemd service)
Execute directly from repository without downloading:
1. Run with dry-run mode to test:
```bash
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))" --dry-run
```
2. Run normally (creates actual tickets):
```bash
/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"
```
## Configuration
@@ -73,6 +89,8 @@ The daemon creates and maintains:
- **Log Directory**: `/var/log/hwmonDaemon/`
- **Historical SMART Data**: JSON files for trend analysis
- **Data Retention**: 30 days of historical monitoring data
- **Storage Limit**: Automatically enforced 10MB maximum
- **Cleanup**: Oldest files deleted first when limit exceeded
## Ticket Creation
@@ -107,9 +125,22 @@ The following paths are automatically excluded from monitoring:
The daemon runs:
- Daily via systemd timer
- Hourly via systemd timer (with 60-second randomized delay)
- As root user for hardware access
- With automatic restart on failure
- 5-minute timeout for execution
- Logs to systemd journal
## Recent Improvements
**Version 2.0** (January 2026):
- ✅ Added 10MB storage limit with automatic cleanup
- ✅ File locking to prevent race conditions
- ✅ Disabled monitoring for unreliable Ridata drives
- ✅ Added timeouts to all network/subprocess calls (10s API, 30s subprocess)
- ✅ Fixed unchecked regex patterns
- ✅ Improved error handling throughout
- ✅ Enhanced systemd service configuration with restart policies
## Troubleshooting

Binary file not shown.

375
grafana-dashboard.json Normal file
View File

@@ -0,0 +1,375 @@
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "Prometheus data source for hwmonDaemon metrics",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__elements": {},
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "10.0.0"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "gauge",
"name": "Gauge",
"version": ""
},
{
"type": "panel",
"id": "stat",
"name": "Stat",
"version": ""
},
{
"type": "panel",
"id": "table",
"name": "Table",
"version": ""
},
{
"type": "panel",
"id": "timeseries",
"name": "Time series",
"version": ""
}
],
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"id": 1,
"panels": [],
"title": "Overview",
"type": "row"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [
{"options": {"0": {"color": "red", "index": 0, "text": "Issues Detected"}}, "type": "value"},
{"options": {"from": 1, "result": {"color": "green", "index": 1, "text": "Healthy"}, "to": 999999}, "type": "range"}
],
"thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
},
"overrides": []
},
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 1},
"id": 2,
"options": {"colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
"pluginVersion": "10.0.0",
"targets": [{"expr": "count(hwmon_info)", "legendFormat": "Hosts", "refId": "A"}],
"title": "Monitored Hosts",
"type": "stat"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [],
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 3}]}
},
"overrides": []
},
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 1},
"id": 3,
"options": {"colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
"pluginVersion": "10.0.0",
"targets": [{"expr": "sum(hwmon_issues_total)", "legendFormat": "Issues", "refId": "A"}],
"title": "Total Issues",
"type": "stat"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [],
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 2}]}
},
"overrides": []
},
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 1},
"id": 4,
"options": {"colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
"pluginVersion": "10.0.0",
"targets": [{"expr": "count(hwmon_drive_smart_healthy == 0)", "legendFormat": "Unhealthy", "refId": "A"}],
"title": "Unhealthy Drives",
"type": "stat"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [
{"options": {"0": {"color": "red", "index": 0, "text": "Unhealthy"}}, "type": "value"},
{"options": {"1": {"color": "green", "index": 1, "text": "Healthy"}}, "type": "value"}
],
"thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
},
"overrides": []
},
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 1},
"id": 5,
"options": {"colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
"pluginVersion": "10.0.0",
"targets": [{"expr": "min(hwmon_ceph_cluster_healthy)", "legendFormat": "Ceph", "refId": "A"}],
"title": "Ceph Cluster Health",
"type": "stat"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [],
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 2}]}
},
"overrides": []
},
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 1},
"id": 6,
"options": {"colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "textMode": "auto"},
"pluginVersion": "10.0.0",
"targets": [{"expr": "sum(hwmon_ceph_osd_down)", "legendFormat": "Down OSDs", "refId": "A"}],
"title": "Ceph OSDs Down",
"type": "stat"
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 5},
"id": 10,
"panels": [],
"title": "Drive Health",
"type": "row"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {"axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": {"legend": false, "tooltip": false, "viz": false}, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": {"type": "linear"}, "showPoints": "never", "spanNulls": false, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "line"}},
"mappings": [],
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 45}, {"color": "red", "value": 55}]},
"unit": "celsius"
},
"overrides": []
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 6},
"id": 11,
"options": {"legend": {"calcs": ["lastNotNull", "max"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
"pluginVersion": "10.0.0",
"targets": [{"expr": "hwmon_drive_temperature_celsius", "legendFormat": "{{hostname}} {{device}}", "refId": "A"}],
"title": "Drive Temperatures",
"type": "timeseries"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"custom": {"align": "auto", "cellOptions": {"type": "color-text"}, "inspect": false},
"mappings": [
{"options": {"0": {"color": "red", "index": 0, "text": "UNHEALTHY"}}, "type": "value"},
{"options": {"1": {"color": "green", "index": 1, "text": "HEALTHY"}}, "type": "value"}
],
"thresholds": {"mode": "absolute", "steps": [{"color": "red", "value": null}, {"color": "green", "value": 1}]}
},
"overrides": [
{"matcher": {"id": "byName", "options": "hostname"}, "properties": [{"id": "custom.width", "value": 120}]},
{"matcher": {"id": "byName", "options": "device"}, "properties": [{"id": "custom.width", "value": 100}]},
{"matcher": {"id": "byName", "options": "Status"}, "properties": [{"id": "custom.width", "value": 100}]},
{"matcher": {"id": "byName", "options": "Issues"}, "properties": [{"id": "custom.cellOptions", "value": {"mode": "gradient", "type": "gauge"}}, {"id": "thresholds", "value": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 3}]}}]}
]
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 6},
"id": 12,
"options": {"cellHeight": "sm", "footer": {"countRows": false, "fields": "", "reducer": ["sum"], "show": false}, "showHeader": true, "sortBy": [{"desc": true, "displayName": "Issues"}]},
"pluginVersion": "10.0.0",
"targets": [
{"expr": "hwmon_drive_smart_healthy", "format": "table", "instant": true, "legendFormat": "", "refId": "A"},
{"expr": "hwmon_drive_smart_issues_total", "format": "table", "instant": true, "legendFormat": "", "refId": "B"}
],
"title": "Drive Status",
"transformations": [
{"id": "merge", "options": {}},
{"id": "organize", "options": {"excludeByName": {"Time": true, "__name__": true}, "indexByName": {"Value #A": 2, "Value #B": 3, "device": 1, "hostname": 0}, "renameByName": {"Value #A": "Status", "Value #B": "Issues", "device": "Device", "hostname": "Host"}}}
],
"type": "table"
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 14},
"id": 20,
"panels": [],
"title": "System Resources",
"type": "row"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {"axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": {"legend": false, "tooltip": false, "viz": false}, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": {"type": "linear"}, "showPoints": "never", "spanNulls": false, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "line"}},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 80}, {"color": "red", "value": 95}]},
"unit": "percent"
},
"overrides": []
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 15},
"id": 21,
"options": {"legend": {"calcs": ["lastNotNull", "max"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
"pluginVersion": "10.0.0",
"targets": [{"expr": "hwmon_cpu_usage_percent", "legendFormat": "{{hostname}}", "refId": "A"}],
"title": "CPU Usage",
"type": "timeseries"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {"axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": {"legend": false, "tooltip": false, "viz": false}, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": {"type": "linear"}, "showPoints": "never", "spanNulls": false, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "line"}},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 80}, {"color": "red", "value": 95}]},
"unit": "percent"
},
"overrides": []
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 15},
"id": 22,
"options": {"legend": {"calcs": ["lastNotNull", "max"], "displayMode": "table", "placement": "right", "showLegend": true}, "tooltip": {"mode": "multi", "sort": "desc"}},
"pluginVersion": "10.0.0",
"targets": [{"expr": "hwmon_memory_usage_percent", "legendFormat": "{{hostname}}", "refId": "A"}],
"title": "Memory Usage",
"type": "timeseries"
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 23},
"id": 30,
"panels": [],
"title": "Ceph Cluster",
"type": "row"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 70}, {"color": "red", "value": 85}]},
"unit": "percent"
},
"overrides": []
},
"gridPos": {"h": 6, "w": 6, "x": 0, "y": 24},
"id": 31,
"options": {"orientation": "auto", "reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}, "showThresholdLabels": false, "showThresholdMarkers": true},
"pluginVersion": "10.0.0",
"targets": [{"expr": "max(hwmon_ceph_cluster_usage_percent)", "legendFormat": "Usage", "refId": "A"}],
"title": "Ceph Cluster Usage",
"type": "gauge"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {"axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": {"legend": false, "tooltip": false, "viz": false}, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": {"type": "linear"}, "showPoints": "never", "spanNulls": false, "stacking": {"group": "A", "mode": "none"}, "thresholdsStyle": {"mode": "line"}},
"mappings": [],
"max": 100,
"min": 0,
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 70}, {"color": "red", "value": 85}]},
"unit": "percent"
},
"overrides": []
},
"gridPos": {"h": 6, "w": 18, "x": 6, "y": 24},
"id": 32,
"options": {"legend": {"calcs": ["lastNotNull"], "displayMode": "list", "placement": "right", "showLegend": true}, "tooltip": {"mode": "single", "sort": "none"}},
"pluginVersion": "10.0.0",
"targets": [{"expr": "hwmon_ceph_cluster_usage_percent", "legendFormat": "{{hostname}}", "refId": "A"}],
"title": "Ceph Usage Over Time",
"type": "timeseries"
},
{
"collapsed": false,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 30},
"id": 40,
"panels": [],
"title": "LXC Containers",
"type": "row"
},
{
"datasource": {"type": "prometheus", "uid": "${DS_PROMETHEUS}"},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"custom": {"align": "auto", "cellOptions": {"type": "auto"}, "inspect": false},
"mappings": [],
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}, {"color": "yellow", "value": 80}, {"color": "red", "value": 95}]}
},
"overrides": [
{"matcher": {"id": "byName", "options": "Usage"}, "properties": [{"id": "custom.cellOptions", "value": {"mode": "gradient", "type": "gauge"}}, {"id": "unit", "value": "percent"}, {"id": "max", "value": 100}, {"id": "min", "value": 0}]}
]
},
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 31},
"id": 41,
"options": {"cellHeight": "sm", "footer": {"countRows": false, "fields": "", "reducer": ["sum"], "show": false}, "showHeader": true, "sortBy": [{"desc": true, "displayName": "Usage"}]},
"pluginVersion": "10.0.0",
"targets": [{"expr": "hwmon_lxc_storage_usage_percent", "format": "table", "instant": true, "legendFormat": "", "refId": "A"}],
"title": "LXC Storage Usage",
"transformations": [
{"id": "organize", "options": {"excludeByName": {"Time": true, "__name__": true}, "indexByName": {}, "renameByName": {"Value": "Usage", "hostname": "Host", "mountpoint": "Mountpoint", "vmid": "Container ID"}}}
],
"type": "table"
}
],
"refresh": "1m",
"schemaVersion": 38,
"style": "dark",
"tags": ["hwmon", "hardware", "monitoring", "proxmox"],
"templating": {"list": []},
"time": {"from": "now-6h", "to": "now"},
"timepicker": {},
"timezone": "browser",
"title": "hwmonDaemon - Hardware Monitor",
"uid": "hwmondaemon",
"version": 1,
"weekStart": ""
}

View File

@@ -7,6 +7,11 @@ Type=oneshot
ExecStart=/usr/bin/env python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/hwmonDaemon/raw/branch/main/hwmonDaemon.py').read().decode('utf-8'))"
User=root
Group=root
Restart=on-failure
RestartSec=60
TimeoutStartSec=300
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target

View File

@@ -1,9 +1,10 @@
[Unit]
Description=Run System Health Monitoring Daemon Daily
Description=Run System Health Monitoring Daemon Hourly
[Timer]
OnCalendar=hourly
RandomizedDelaySec=300
RandomizedDelaySec=60
Persistent=true
[Install]
WantedBy=timers.target

File diff suppressed because it is too large Load Diff