Files
analyzeOSDs/Claude.md
Jared Vititoe 1848b71c2a Optimize OSD analyzer: prioritize failing drives and improve SMART collection
Major improvements to scoring and data collection:

**Scoring Changes:**
- Failed SMART reads now return 0/100 health (was 50/100)
- Critical health issues get much higher penalties:
  * Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x)
  * Pending sectors: -60 pts, 10x multiplier (was -25, 5x)
  * Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x)
  * NVMe media errors: -60 pts, 10x multiplier (was -25, 5x)
- Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10)
- Added priority bonuses:
  * Failed SMART + small drive (<5TB): +30 points
  * Failed SMART alone: +20 points
  * Health issues + small drive: +15 points

**Priority Order Now Enforced:**
1. Failed SMART drives (score 90-100)
2. Small drives beginning to fail (70-85)
3. Small healthy drives (40-60)
4. Large failing drives (60-75)

**Enhanced SMART Collection:**
- Added metadata.devices field parsing
- Enhanced dm-device and /dev/mapper/ resolution
- Added ceph-volume lvm list fallback
- Retry logic with 3 command variations per device
- Try with/without sudo, different device flags

**Expected Impact:**
- osd.28 with reallocated sectors jumps from #14 to top 3
- SMART collection failures should drop from 6 to 0-2
- All failing drives rank above healthy drives regardless of size

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-06 15:05:25 -05:00

8.7 KiB

Ceph OSD Replacement Analyzer - Project Documentation

Project Overview

Purpose: Intelligent analysis tool for identifying optimal Ceph OSD replacement candidates across an entire cluster by analyzing health metrics, capacity optimization potential, and cluster resilience factors.

Type: Python 3 CLI tool for Ceph storage cluster maintenance

Target Users: Storage administrators, DevOps engineers, and infrastructure teams managing Ceph clusters

Architecture

Core Components

  1. Data Collection Layer (ceph_osd_analyzer.py:34-172)

    • Executes Ceph commands locally and via SSH
    • Retrieves SMART data from all cluster nodes
    • Handles both local ceph device query-daemon-health-metrics and remote smartctl fallback
    • Device path resolution with dm-device mapping support
  2. Analysis Engine (ceph_osd_analyzer.py:173-357)

    • SMART health parsing for HDD and NVMe devices
    • Capacity optimization scoring
    • Cluster resilience impact calculation
    • Multi-factor weighted scoring system
  3. Reporting System (ceph_osd_analyzer.py:361-525)

    • Color-coded console output
    • Top 15 ranked replacement candidates
    • Summary by device class (HDD/NVMe)
    • Per-host analysis breakdown

Key Design Decisions

Remote SMART Data Collection: The script uses SSH to gather SMART data from all cluster nodes, not just the local node. This is critical because OSDs are distributed across multiple physical hosts.

Fallback Strategy: Primary method uses ceph device query-daemon-health-metrics, with automatic fallback to direct smartctl queries via SSH if Ceph's built-in metrics are unavailable.

Device Mapping: Handles complex storage configurations including device-mapper devices, resolving them to physical drives using lsblk and symlink resolution.

Weighted Scoring: 60% health, 30% capacity optimization, 10% resilience - prioritizes failing drives while considering operational efficiency.

Scoring Algorithm

Health Score (60% weight)

HDD Metrics (ceph_osd_analyzer.py:183-236):

  • Reallocated sectors (ID 5): -20 points for any presence
  • Spin retry count (ID 10): -15 points
  • Pending sectors (ID 197): -25 points (critical indicator)
  • Uncorrectable sectors (ID 198): -30 points (critical)
  • Temperature (ID 190/194): -10 points if >60°C
  • Age (ID 9): -15 points if >5 years

NVMe Metrics (ceph_osd_analyzer.py:239-267):

  • Available spare: penalized if <50%
  • Percentage used: -30 points if >80%
  • Media errors: -25 points for any errors
  • Temperature: -10 points if >70°C

Capacity Score (30% weight)

(ceph_osd_analyzer.py:271-311)

  • Small drives prioritized: <2TB = +40 points (maximum capacity gain)
  • Medium drives: 2-5TB = +30 points, 5-10TB = +15 points
  • High utilization penalty: >70% = -15 points (migration complexity)
  • Host balance bonus: +15 points if below host average weight

Resilience Score (10% weight)

(ceph_osd_analyzer.py:313-357)

  • Hosts with >20% above average OSD count: +20 points
  • Presence of down OSDs on same host: +15 points (hardware issues)

Usage Patterns

sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd

Why: Always uses latest version, no local installation, integrates easily into automation.

Command-Line Options

  • --class [hdd|nvme]: Filter by device type
  • --min-size N: Minimum OSD size in TB
  • --debug: Enable verbose debugging output

Typical Workflow

  1. Run analysis during maintenance window
  2. Identify top 3-5 candidates with scores >70
  3. Review health issues and capacity gains
  4. Plan replacement based on available hardware
  5. Execute OSD out/destroy/replace operations

Dependencies

Required Packages

  • Python 3.6+ (standard library only, no external dependencies)
  • smartmontools package (smartctl binary)
  • SSH access configured between all cluster nodes

Required Permissions

  • Ceph admin keyring access
  • sudo privileges for SMART data retrieval
  • SSH key-based authentication to all OSD hosts

Ceph Commands Used

  • ceph osd tree -f json: Cluster topology
  • ceph osd df -f json: Disk usage statistics
  • ceph osd metadata osd.N -f json: OSD device information
  • ceph device query-daemon-health-metrics osd.N: SMART data

Output Interpretation

Replacement Score Ranges

  • 70-100 (RED): Critical - immediate replacement recommended
  • 50-69 (YELLOW): High priority - plan replacement soon
  • 30-49: Medium priority - next upgrade cycle
  • 0-29 (GREEN): Low priority - healthy drives

Health Score Ranges

  • 80-100 (GREEN): Excellent condition
  • 60-79 (YELLOW): Monitor for issues
  • 40-59: Fair - multiple concerns
  • 0-39 (RED): Critical - replace urgently

Common Issues & Solutions

"No SMART data available"

  • Cause: Missing smartmontools or insufficient permissions
  • Solution: apt install smartmontools and verify sudo access

SSH Timeout Errors

  • Cause: Node unreachable or SSH keys not configured
  • Solution: Verify connectivity with ssh -o ConnectTimeout=5 <host> hostname

Device Path Resolution Failures

  • Cause: Non-standard OSD deployment or encryption
  • Solution: Enable --debug to see device resolution attempts

dm-device Mapping Issues

  • Cause: LVM or LUKS encrypted OSDs
  • Solution: Script automatically resolves via lsblk -no pkname

Development Notes

Code Structure

  • Single file design: Easier to execute remotely via exec()
  • Minimal dependencies: Uses only Python standard library
  • Color-coded output: ANSI escape codes for terminal display
  • Debug mode: Comprehensive logging when --debug enabled

Notable Functions

run_command() (ceph_osd_analyzer.py:34-56): Universal command executor with SSH support and JSON parsing

get_device_path_for_osd() (ceph_osd_analyzer.py:84-122): Complex device resolution logic handling metadata, symlinks, and dm-devices

get_smart_data_remote() (ceph_osd_analyzer.py:124-145): Remote SMART data collection with device type detection

parse_smart_health() (ceph_osd_analyzer.py:173-269): SMART attribute parsing with device-class-specific logic

Future Enhancement Opportunities

  1. Parallel data collection: Use threading for faster cluster-wide analysis
  2. Historical trending: Track scores over time to predict failures
  3. JSON output mode: For integration with monitoring systems
  4. Cost-benefit analysis: Factor in replacement drive costs
  5. PG rebalance impact: Estimate data movement required

Security Considerations

Permissions Required

  • Root access for smartctl execution
  • SSH access to all OSD hosts
  • Ceph admin keyring (read-only sufficient)

Network Requirements

  • Script assumes SSH connectivity between nodes
  • No outbound internet access required (internal-only tool)
  • Hardcoded internal git server URL: http://10.10.10.63:3000

SSH Configuration

  • Uses -o StrictHostKeyChecking=no for automated execution
  • 5-second connection timeout to handle unreachable nodes
  • Assumes key-based authentication is configured

Internal Git Server: http://10.10.10.63:3000/LotusGuild/analyzeOSDs

Related Projects:

  • hwmonDaemon: Hardware monitoring daemon for continuous health checks
  • Other LotusGuild infrastructure automation tools

Maintenance

Version Control

  • Maintained in internal git repository
  • One-line execution always pulls from main branch
  • No formal versioning; latest commit is production

Testing Checklist

  • Test on cluster with mixed HDD/NVMe OSDs
  • Verify SSH connectivity to all hosts
  • Confirm SMART data retrieval for both device types
  • Validate dm-device resolution on encrypted OSDs
  • Check output formatting with various terminal widths
  • Test --class and --min-size filtering

Performance Characteristics

Execution Time: ~5-15 seconds per OSD depending on cluster size and SSH latency

Bottlenecks:

  • Serial OSD processing (parallelization would help)
  • SSH round-trip times for SMART data
  • SMART data parsing can be slow for unresponsive drives

Resource Usage: Minimal CPU/memory, I/O bound on SSH operations

Intended Audience: LotusGuild infrastructure team

Support: Submit issues or pull requests to internal git repository