Files

Jared Vititoe 3d498a4092 CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives

**Root Cause Found**: All 6 NVMe SMART failures were due to parsing bug!

Ceph's `device query-daemon-health-metrics` returns data in nested format:
```json
{
  "DEVICE_ID": {
    "nvme_smart_health_information_log": { ... }
  }
}
```

Script was checking for `nvme_smart_health_information_log` at top level,
so it always failed and fell back to SSH smartctl (which also failed).

**Fix**:
- Extract first device entry from nested dict structure
- Maintain backward compatibility for direct format
- Now correctly parses NVMe SMART from Ceph's built-in metrics

**Expected Impact**:
- All 6 NVMe drives will now successfully read SMART data
- Should drop from "CRITICAL: No SMART data" to proper health scores
- Only truly healthy NVMe drives will show 100/100 health
- Failing NVMe drives will be properly detected and ranked

**Testing**:
Verified `ceph device query-daemon-health-metrics osd.0` returns full
NVMe SMART data including:
- available_spare: 100%
- percentage_used: 12%
- media_errors: 0
- temperature: 38°C

This data was always available but wasn't being parsed!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-06 15:11:53 -05:00

3.5 KiB

Raw Blame History

NVMe SMART Data Collection Troubleshooting

Issue Observed

All NVMe drives (osd.0, osd.10, osd.22, osd.23) are failing SMART data collection with error:

DEBUG: All SMART methods failed for /dev/nvme0n1 on <hostname>

Commands Attempted (All Failed)

sudo smartctl -a -j /dev/nvme0n1 -d nvme
smartctl -a -j /dev/nvme0n1 -d nvme (without sudo)
sudo smartctl -a -j /dev/nvme0n1 (without -d flag)

Possible Causes

1. Smartctl Version Too Old

NVMe JSON output requires smartctl 7.0+. Check version:

ssh large1 "smartctl --version | head -1"

If version < 7.0, JSON output (-j) may not work with NVMe.

2. NVMe Admin Passthrough Permission

NVMe requires CAP_SYS_ADMIN capability. SSH sudo might not preserve capabilities.

3. NVMe Device Naming

Some systems use /dev/nvme0 instead of /dev/nvme0n1 for SMART queries.

Recommended Fixes

Option 1: Try Without JSON Flag for NVMe

Modify the script to use non-JSON output for NVMe and parse text:

# For NVMe, if JSON fails, try text output
if "nvme" in device_path:
    result = run_command(f"sudo nvme smart-log {device_path}", host=hostname)
    # Parse text output

Option 2: Use nvme-cli Tool

The nvme command often works better than smartctl for NVMe:

ssh large1 "sudo nvme smart-log /dev/nvme0 -o json"

Option 3: Check Ceph's Built-in Metrics First

The script tries ceph device query-daemon-health-metrics first, which should work for NVMe if the OSD daemon has access. Verify:

ceph device query-daemon-health-metrics osd.0 -f json

If this works locally but not via the script, there may be a permission issue.

Testing Commands

Test on compute-storage-01 (osd.0)

# Check smartctl version
ssh compute-storage-01 "smartctl --version"

# Try direct smartctl
ssh compute-storage-01 "sudo smartctl -a /dev/nvme0n1"

# Try nvme-cli
ssh compute-storage-01 "sudo nvme smart-log /dev/nvme0"

# Try from Ceph directly
ceph device query-daemon-health-metrics osd.0 -f json

Test on large1 (osd.10, osd.23)

# Two NVMe devices on this host
ssh large1 "sudo smartctl -a /dev/nvme0n1"
ssh large1 "sudo smartctl -a /dev/nvme1n1"

# Try nvme-cli
ssh large1 "sudo nvme list"
ssh large1 "sudo nvme smart-log /dev/nvme0"
ssh large1 "sudo nvme smart-log /dev/nvme1"

Workaround for Now

Since 6 OSDs with failed SMART are all scoring 100/100 and ranking at the top, the prioritization is working correctly. However, we need to differentiate between:

Truly failed/unreadable drives (hardware problem)
SMART collection failures (script/permission issue)

If these NVMe drives are actually healthy but we just can't read SMART, they shouldn't all be #1 priority.

Quick Fix: Check if Drive is Actually Accessible

Add a health check before marking SMART as failed:

# Before returning None, check if device is responsive
health_check = run_command(f"test -e {device_path} && echo 'OK'", host=hostname)
if health_check == "OK":
    # Device exists but SMART failed - might be permissions
    return {"status": "smart_read_failed", "device_accessible": True}
else:
    # Device doesn't exist or is dead
    return {"status": "device_failed", "device_accessible": False}

This would let us score SMART-read-failures differently from truly-dead drives.

Action Items

Test smartctl version on all nodes
Test nvme-cli availability
Verify Ceph daemon health metrics work locally
Consider adding device accessibility check
May need to add nvme-cli as fallback method

3.5 KiB Raw Blame History