13 Commits

Author SHA1 Message Date
03374fa784 Add USB drive SMART support with multiple bridge chipset attempts
**Issue**: osd.2 is a USB-connected 1TB drive that couldn't read SMART

Error was: "Read Device Identity failed: scsi error unsupported field"
This is typical for USB-attached drives that need bridge-specific flags.

**Solution**: Added USB transport detection and multiple fallback methods:
- SAT (SCSI-ATA Translation) - most common USB bridges
- usbjmicron - JMicron USB bridge chipsets
- usbcypress - Cypress USB bridge chipsets
- Generic USB fallback
- SCSI passthrough

Also added USB/SAT attempt to unknown transport types as fallback.

**Debug Enhancement**:
- Now shows detected transport type in debug output
- Helps diagnose why SMART fails

**Note**: USB drives in Ceph clusters are unconventional but functional.
This OSD appears to be temporary/supplemental storage capacity.

If SMART still fails after this update, the USB bridge may be incompatible
with smartmontools, which is acceptable for temporary storage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 15:16:35 -05:00
3d498a4092 CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives
**Root Cause Found**: All 6 NVMe SMART failures were due to parsing bug!

Ceph's `device query-daemon-health-metrics` returns data in nested format:
```json
{
  "DEVICE_ID": {
    "nvme_smart_health_information_log": { ... }
  }
}
```

Script was checking for `nvme_smart_health_information_log` at top level,
so it always failed and fell back to SSH smartctl (which also failed).

**Fix**:
- Extract first device entry from nested dict structure
- Maintain backward compatibility for direct format
- Now correctly parses NVMe SMART from Ceph's built-in metrics

**Expected Impact**:
- All 6 NVMe drives will now successfully read SMART data
- Should drop from "CRITICAL: No SMART data" to proper health scores
- Only truly healthy NVMe drives will show 100/100 health
- Failing NVMe drives will be properly detected and ranked

**Testing**:
Verified `ceph device query-daemon-health-metrics osd.0` returns full
NVMe SMART data including:
- available_spare: 100%
- percentage_used: 12%
- media_errors: 0
- temperature: 38°C

This data was always available but wasn't being parsed!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 15:11:53 -05:00
35a16a1793 Fix reallocated sector scoring - drives with bad sectors now rank correctly
**Problem**: osd.28 with 16 reallocated sectors only ranked #7 with score 40.8
This is a CRITICAL failing drive that should rank just below failed SMART reads.

**Changes**:
- Reallocated sectors now use tiered penalties:
  * 10+ sectors: -95 points (health = 5/100) - DRIVE FAILING
  * 5-9 sectors: -85 points (health = 15/100) - CRITICAL
  * 1-4 sectors: -70 points (health = 30/100) - SERIOUS
- Added critical_issues detection for sector problems
- Critical issues get +20 bonus (large) or +25 (small) in scoring
- Updated issue text to "DRIVE FAILING" for clarity

**Expected Result**:
- osd.28 will now score ~96/100 and rank #7 (right after 6 failed SMART)
- Any drive with reallocated/pending/uncorrectable sectors gets top priority
- Matches priority: Failed SMART > Critical sectors > Small failing > Rest

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 15:08:46 -05:00
1848b71c2a Optimize OSD analyzer: prioritize failing drives and improve SMART collection
Major improvements to scoring and data collection:

**Scoring Changes:**
- Failed SMART reads now return 0/100 health (was 50/100)
- Critical health issues get much higher penalties:
  * Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x)
  * Pending sectors: -60 pts, 10x multiplier (was -25, 5x)
  * Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x)
  * NVMe media errors: -60 pts, 10x multiplier (was -25, 5x)
- Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10)
- Added priority bonuses:
  * Failed SMART + small drive (<5TB): +30 points
  * Failed SMART alone: +20 points
  * Health issues + small drive: +15 points

**Priority Order Now Enforced:**
1. Failed SMART drives (score 90-100)
2. Small drives beginning to fail (70-85)
3. Small healthy drives (40-60)
4. Large failing drives (60-75)

**Enhanced SMART Collection:**
- Added metadata.devices field parsing
- Enhanced dm-device and /dev/mapper/ resolution
- Added ceph-volume lvm list fallback
- Retry logic with 3 command variations per device
- Try with/without sudo, different device flags

**Expected Impact:**
- osd.28 with reallocated sectors jumps from #14 to top 3
- SMART collection failures should drop from 6 to 0-2
- All failing drives rank above healthy drives regardless of size

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 15:05:25 -05:00
3b15377821 seperate smartctl depending on device class 2025-12-22 18:23:06 -05:00
c315fa3efc Updated readme again 2025-12-22 18:15:46 -05:00
c252dbcdc4 resolves /dev/dm-* now 2025-12-22 18:05:32 -05:00
1610aa2606 removed pg and latency counters 2025-12-22 17:14:02 -05:00
db757345fb Better patterns and error handling 2025-12-22 17:08:13 -05:00
e12b53238e Pushed malformed code, whoops 2025-12-22 17:00:45 -05:00
559ed9fc94 adds /dev in front of block devices 2025-12-22 16:57:53 -05:00
43d35feb46 Enables ssh to all hosts to gather smart data 2025-12-22 16:50:04 -05:00
7dab2591b1 First test 2025-12-22 16:40:19 -05:00