2025-12-22 16:46:02 -05:00

Ceph OSD Replacement Analyzer

Advanced analysis tool for identifying optimal Ceph OSD replacement candidates based on multiple health, capacity, resilience, and performance factors.

Overview

This script provides comprehensive analysis of your Ceph cluster to identify which OSDs should be replaced first. It combines SMART health data, capacity optimization potential, cluster resilience impact, and performance metrics to generate a prioritized replacement list.

Features

Multi-Factor Scoring System

  • Health Analysis (40% weight)

    • SMART attribute parsing (reallocated sectors, pending sectors, uncorrectable errors)
    • Drive wear leveling analysis (SSD/NVMe)
    • Temperature monitoring
    • Drive age calculation
    • Media error detection
  • Capacity Optimization (30% weight)

    • Identifies undersized drives for maximum capacity gains
    • Analyzes host-level capacity balance
    • Considers utilization for migration planning
  • Resilience Improvement (20% weight)

    • Host-level OSD distribution analysis
    • Identifies hosts with above-average OSD counts
    • Detects hosts with hardware issues (down OSDs)
  • Performance Metrics (10% weight)

    • Commit and apply latency analysis
    • PG distribution balance
    • Identifies performance outliers

Comprehensive Data Collection

The script leverages multiple Ceph commands:

  • ceph osd tree - Cluster topology and hierarchy
  • ceph osd df - Disk usage and utilization statistics
  • ceph osd metadata - OSD device information
  • ceph osd perf - Performance metrics (latency)
  • ceph pg dump - PG distribution analysis
  • ceph device query-daemon-health-metrics - SMART data retrieval
  • Fallback to smartctl for direct SMART access

Intelligent Analysis

  • Separate analysis for HDD and NVMe device classes
  • Detailed scoring breakdowns with human-readable explanations
  • Color-coded output for quick visual assessment
  • Top 15 candidates ranked by replacement priority
  • Summary statistics by device class

Installation

Quick Execute (No Download)

Run directly from your internal git server:

sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"

Traditional Installation

# Clone repository
git clone http://10.10.10.63:3000/LotusGuild/analyzeOSDs.git
cd analyzeOSDs

# Make executable
chmod +x ceph_osd_analyzer.py

# Run
sudo ./ceph_osd_analyzer.py

Usage

Basic Analysis

sudo python3 ceph_osd_analyzer.py

Filter by Device Class

Analyze only HDDs:

sudo python3 ceph_osd_analyzer.py --class hdd

Analyze only NVMe drives:

sudo python3 ceph_osd_analyzer.py --class nvme

Set Minimum Size Threshold

Only consider drives above a certain size:

sudo python3 ceph_osd_analyzer.py --min-size 8

Combined Options

sudo python3 ceph_osd_analyzer.py --class hdd --min-size 4

Output Interpretation

Replacement Score

  • 70-100: Critical - Replace immediately (health issues, very small, or severe performance problems)
  • 50-69: High Priority - Replace soon (combination of factors)
  • 30-49: Medium Priority - Consider for next upgrade cycle
  • 0-29: Low Priority - Healthy drives with less optimization potential

Health Score

  • 80-100: Excellent - No significant issues detected
  • 60-79: Good - Minor issues, monitor
  • 40-59: Fair - Multiple issues detected
  • 0-39: Poor - Critical health issues, replace urgently

Example Output

=== TOP REPLACEMENT CANDIDATES ===

#1 - osd.6 (HDD)
  Host: compute-storage-01
  Size: 0.91 TB (weight: 1.00)
  Utilization: 46.3% | PGs: 4
  Replacement Score: 78.5/100
  Health Score: 45.2/100
  Health Issues:
    - Reallocated sectors: 12
    - High temperature: 62°C
    - Drive age: 6.2 years
  Capacity Optimization:
    • Very small drive (1.0TB) - high capacity gain
    • Below host average (8.1TB) - improves balance
  Resilience Impact:
    • Host has 10 hdd OSDs (above average 7.8)

Understanding the Analysis

What Makes a Good Replacement Candidate?

  1. Small Capacity Drives: Replacing a 1TB drive with a 16TB drive yields maximum capacity improvement
  2. Health Issues: Drives with reallocated sectors, high wear, or errors should be replaced proactively
  3. Host Imbalance: Hosts with many small drives benefit from consolidation
  4. Age: Older drives (5+ years) are approaching end-of-life
  5. Performance: High latency drives drag down cluster performance

Why One-Line Execution?

The one-line execution method allows you to:

  • Run analysis from any cluster node without installation
  • Always use the latest version from git
  • Integrate into monitoring/alerting scripts
  • Quick ad-hoc analysis during maintenance

Requirements

  • Python 3.6+
  • Ceph cluster with admin privileges
  • smartctl installed (usually part of smartmontools package)
  • Root/sudo access for SMART data retrieval

Architecture Notes

Device Classes

The script automatically handles separate pools for HDDs and NVMe devices. It recognizes that these device classes have different performance characteristics, wear patterns, and SMART attributes.

SMART Data Parsing

  • HDDs: Focuses on reallocated sectors, pending sectors, spin retry count, and mechanical health
  • NVMe/SSDs: Emphasizes wear leveling, percentage used, available spare, and media errors

Host-Level Analysis

The script considers cluster resilience by analyzing OSD distribution across hosts. Replacing drives on overloaded hosts can improve failure domain balance.

Troubleshooting

"No SMART data available"

  • Ensure smartmontools is installed: apt install smartmontools or yum install smartmontools
  • Verify root/sudo access
  • Check if the OSD device supports SMART

"Failed to gather cluster data"

  • Verify you're running on a Ceph cluster node
  • Ensure proper Ceph admin permissions
  • Check if ceph command is in PATH

Permission Denied

  • Script requires sudo/root for SMART data access
  • Run with: sudo python3 ceph_osd_analyzer.py

Contributing

This tool is maintained internally by LotusGuild. For improvements or bug reports, submit issues or pull requests to the git repository.

License

Internal use only - LotusGuild infrastructure tools.

  • hwmonDaemon - Hardware monitoring daemon
  • Other LotusGuild infrastructure tools

Changelog

v1.0.0 (Initial Release)

  • Multi-factor scoring system
  • SMART health analysis for HDD and NVMe
  • Capacity optimization analysis
  • Resilience impact calculation
  • Performance metrics integration
  • Color-coded output
  • Device class filtering
  • Minimum size filtering
Description
No description provided
Readme 153 KiB
Languages
Python 100%