CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives
**Root Cause Found**: All 6 NVMe SMART failures were due to parsing bug!
Ceph's `device query-daemon-health-metrics` returns data in nested format:
```json
{
"DEVICE_ID": {
"nvme_smart_health_information_log": { ... }
}
}
```
Script was checking for `nvme_smart_health_information_log` at top level,
so it always failed and fell back to SSH smartctl (which also failed).
**Fix**:
- Extract first device entry from nested dict structure
- Maintain backward compatibility for direct format
- Now correctly parses NVMe SMART from Ceph's built-in metrics
**Expected Impact**:
- All 6 NVMe drives will now successfully read SMART data
- Should drop from "CRITICAL: No SMART data" to proper health scores
- Only truly healthy NVMe drives will show 100/100 health
- Failing NVMe drives will be properly detected and ranked
**Testing**:
Verified `ceph device query-daemon-health-metrics osd.0` returns full
NVMe SMART data including:
- available_spare: 100%
- percentage_used: 12%
- media_errors: 0
- temperature: 38°C
This data was always available but wasn't being parsed!
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -193,14 +193,25 @@ def get_device_health(osd_id, hostname):
|
||||
"""Get device SMART health metrics from the appropriate host"""
|
||||
if DEBUG:
|
||||
print(f"{Colors.CYAN}DEBUG: Getting health for osd.{osd_id} on {hostname}{Colors.END}")
|
||||
|
||||
|
||||
# First try ceph's built-in health metrics
|
||||
data = run_command(f"ceph device query-daemon-health-metrics osd.{osd_id} -f json 2>/dev/null", parse_json=True)
|
||||
|
||||
if data and ('ata_smart_attributes' in data or 'nvme_smart_health_information_log' in data):
|
||||
if DEBUG:
|
||||
print(f"{Colors.GREEN}DEBUG: Got SMART data from ceph device query{Colors.END}")
|
||||
return data
|
||||
|
||||
if data:
|
||||
# Ceph returns data nested under device ID, extract it
|
||||
if isinstance(data, dict) and len(data) > 0:
|
||||
# Get the first (and usually only) device entry
|
||||
device_data = next(iter(data.values())) if data else None
|
||||
if device_data and ('ata_smart_attributes' in device_data or 'nvme_smart_health_information_log' in device_data):
|
||||
if DEBUG:
|
||||
print(f"{Colors.GREEN}DEBUG: Got SMART data from ceph device query (nested format){Colors.END}")
|
||||
return device_data
|
||||
|
||||
# Also check if data is already in the right format (backward compatibility)
|
||||
if 'ata_smart_attributes' in data or 'nvme_smart_health_information_log' in data:
|
||||
if DEBUG:
|
||||
print(f"{Colors.GREEN}DEBUG: Got SMART data from ceph device query (direct format){Colors.END}")
|
||||
return data
|
||||
|
||||
# If that fails, get device path and query via SSH
|
||||
device_path = get_device_path_for_osd(osd_id, hostname)
|
||||
|
||||
Reference in New Issue
Block a user