3d498a4092
CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives
...
**Root Cause Found**: All 6 NVMe SMART failures were due to parsing bug!
Ceph's `device query-daemon-health-metrics` returns data in nested format:
```json
{
"DEVICE_ID": {
"nvme_smart_health_information_log": { ... }
}
}
```
Script was checking for `nvme_smart_health_information_log` at top level,
so it always failed and fell back to SSH smartctl (which also failed).
**Fix**:
- Extract first device entry from nested dict structure
- Maintain backward compatibility for direct format
- Now correctly parses NVMe SMART from Ceph's built-in metrics
**Expected Impact**:
- All 6 NVMe drives will now successfully read SMART data
- Should drop from "CRITICAL: No SMART data" to proper health scores
- Only truly healthy NVMe drives will show 100/100 health
- Failing NVMe drives will be properly detected and ranked
**Testing**:
Verified `ceph device query-daemon-health-metrics osd.0` returns full
NVMe SMART data including:
- available_spare: 100%
- percentage_used: 12%
- media_errors: 0
- temperature: 38°C
This data was always available but wasn't being parsed!
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-06 15:11:53 -05:00
35a16a1793
Fix reallocated sector scoring - drives with bad sectors now rank correctly
...
**Problem**: osd.28 with 16 reallocated sectors only ranked #7 with score 40.8
This is a CRITICAL failing drive that should rank just below failed SMART reads.
**Changes**:
- Reallocated sectors now use tiered penalties:
* 10+ sectors: -95 points (health = 5/100) - DRIVE FAILING
* 5-9 sectors: -85 points (health = 15/100) - CRITICAL
* 1-4 sectors: -70 points (health = 30/100) - SERIOUS
- Added critical_issues detection for sector problems
- Critical issues get +20 bonus (large) or +25 (small) in scoring
- Updated issue text to "DRIVE FAILING" for clarity
**Expected Result**:
- osd.28 will now score ~96/100 and rank #7 (right after 6 failed SMART)
- Any drive with reallocated/pending/uncorrectable sectors gets top priority
- Matches priority: Failed SMART > Critical sectors > Small failing > Rest
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-06 15:08:46 -05:00
1848b71c2a
Optimize OSD analyzer: prioritize failing drives and improve SMART collection
...
Major improvements to scoring and data collection:
**Scoring Changes:**
- Failed SMART reads now return 0/100 health (was 50/100)
- Critical health issues get much higher penalties:
* Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x)
* Pending sectors: -60 pts, 10x multiplier (was -25, 5x)
* Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x)
* NVMe media errors: -60 pts, 10x multiplier (was -25, 5x)
- Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10)
- Added priority bonuses:
* Failed SMART + small drive (<5TB): +30 points
* Failed SMART alone: +20 points
* Health issues + small drive: +15 points
**Priority Order Now Enforced:**
1. Failed SMART drives (score 90-100)
2. Small drives beginning to fail (70-85)
3. Small healthy drives (40-60)
4. Large failing drives (60-75)
**Enhanced SMART Collection:**
- Added metadata.devices field parsing
- Enhanced dm-device and /dev/mapper/ resolution
- Added ceph-volume lvm list fallback
- Retry logic with 3 command variations per device
- Try with/without sudo, different device flags
**Expected Impact:**
- osd.28 with reallocated sectors jumps from #14 to top 3
- SMART collection failures should drop from 6 to 0-2
- All failing drives rank above healthy drives regardless of size
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-06 15:05:25 -05:00
3b15377821
seperate smartctl depending on device class
2025-12-22 18:23:06 -05:00
c315fa3efc
Updated readme again
2025-12-22 18:15:46 -05:00
89037ed93f
Merge branch 'main' of code.lotusguild.org:LotusGuild/analyzeOSDs
2025-12-22 18:12:42 -05:00
c87c13eb1f
revert 9793f8bcbe
...
revert Updated quick execute commands
2025-12-22 18:10:00 -05:00
c252dbcdc4
resolves /dev/dm-* now
2025-12-22 18:05:32 -05:00
9793f8bcbe
Updated quick execute commands
2025-12-22 17:51:33 -05:00
1610aa2606
removed pg and latency counters
2025-12-22 17:14:02 -05:00
db757345fb
Better patterns and error handling
2025-12-22 17:08:13 -05:00
e12b53238e
Pushed malformed code, whoops
2025-12-22 17:00:45 -05:00
559ed9fc94
adds /dev in front of block devices
2025-12-22 16:57:53 -05:00
43d35feb46
Enables ssh to all hosts to gather smart data
2025-12-22 16:50:04 -05:00
a861276013
Created README
2025-12-22 16:46:02 -05:00
7dab2591b1
First test
2025-12-22 16:40:19 -05:00
983b1f1c29
first commit
2025-12-22 16:39:49 -05:00