analyzeOSDs

LotusGuild/analyzeOSDs

Fork 0

Commit Graph

Author	SHA1	Message	Date
Jared Vititoe	3d498a4092	CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives Root Cause Found: All 6 NVMe SMART failures were due to parsing bug! Ceph's `device query-daemon-health-metrics` returns data in nested format: ```json { "DEVICE_ID": { "nvme_smart_health_information_log": { ... } } } ``` Script was checking for `nvme_smart_health_information_log` at top level, so it always failed and fell back to SSH smartctl (which also failed). Fix: - Extract first device entry from nested dict structure - Maintain backward compatibility for direct format - Now correctly parses NVMe SMART from Ceph's built-in metrics Expected Impact: - All 6 NVMe drives will now successfully read SMART data - Should drop from "CRITICAL: No SMART data" to proper health scores - Only truly healthy NVMe drives will show 100/100 health - Failing NVMe drives will be properly detected and ranked Testing: Verified `ceph device query-daemon-health-metrics osd.0` returns full NVMe SMART data including: - available_spare: 100% - percentage_used: 12% - media_errors: 0 - temperature: 38°C This data was always available but wasn't being parsed! 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 15:11:53 -05:00

Author

SHA1

Message

Date

Jared Vititoe

3d498a4092

CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives

**Root Cause Found**: All 6 NVMe SMART failures were due to parsing bug!

Ceph's `device query-daemon-health-metrics` returns data in nested format:
```json
{
  "DEVICE_ID": {
    "nvme_smart_health_information_log": { ... }
  }
}
```

Script was checking for `nvme_smart_health_information_log` at top level,
so it always failed and fell back to SSH smartctl (which also failed).

**Fix**:
- Extract first device entry from nested dict structure
- Maintain backward compatibility for direct format
- Now correctly parses NVMe SMART from Ceph's built-in metrics

**Expected Impact**:
- All 6 NVMe drives will now successfully read SMART data
- Should drop from "CRITICAL: No SMART data" to proper health scores
- Only truly healthy NVMe drives will show 100/100 health
- Failing NVMe drives will be properly detected and ranked

**Testing**:
Verified `ceph device query-daemon-health-metrics osd.0` returns full
NVMe SMART data including:
- available_spare: 100%
- percentage_used: 12%
- media_errors: 0
- temperature: 38°C

This data was always available but wasn't being parsed!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-06 15:11:53 -05:00

1 Commits