2 Commits

Author SHA1 Message Date
jared 78691e6235 ci: add notify-failure, pytest with coverage, and 49 unit tests
Lint / Python (flake8) (push) Failing after 20s
Security / Python Security (bandit) (push) Successful in 25s
Test / Python Tests (pytest) (push) Successful in 57s
Lint / Notify on failure (push) Successful in 2s
- lint.yml: add notify-failure Matrix alert job
- test.yml: new workflow running pytest with pytest-cov for coverage
- .coveragerc: omit tests and site-packages from coverage
- .gitignore: ignore __pycache__ and .pyc files
- tests/test_hwmon.py: 49 unit tests covering SystemHealthMonitor
  (temperature parsing, service monitoring, disk usage, metric collection,
  dry run behaviour); uses unittest.mock to isolate from env/filesystem

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 16:25:23 -04:00
jared 0f8918fb8b Add Ceph cluster monitoring and Prometheus metrics export
- Add comprehensive Ceph cluster health monitoring
  - Check cluster health status (HEALTH_OK/WARN/ERR)
  - Monitor cluster usage with configurable thresholds
  - Track OSD status (up/down) per node
  - Separate cluster-wide vs node-specific issues

- Cluster-wide ticket deduplication
  - Add [cluster-wide] scope tag for Ceph issues
  - Cluster-wide issues deduplicate across all nodes
  - Node-specific issues (OSD down) include hostname

- Add Prometheus metrics export
  - export_prometheus_metrics() method
  - write_prometheus_metrics() for textfile collector
  - --metrics CLI flag to output metrics to stdout
  - --export-json CLI flag to export health report as JSON

- Add Grafana dashboard template (grafana-dashboard.json)
- Add .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 15:54:16 -05:00