Advanced analysis tool for identifying optimal Ceph OSD replacement candidates based on multiple health, capacity, resilience, and performance factors.
## Overview
This script provides comprehensive analysis of your Ceph cluster to identify which OSDs should be replaced first. It combines SMART health data, capacity optimization potential, cluster resilience impact, and performance metrics to generate a prioritized replacement list.
- **70-100**: Critical - Replace immediately (health issues, very small, or severe performance problems)
- **50-69**: High Priority - Replace soon (combination of factors)
- **30-49**: Medium Priority - Consider for next upgrade cycle
- **0-29**: Low Priority - Healthy drives with less optimization potential
### Health Score
- **80-100**: Excellent - No significant issues detected
- **60-79**: Good - Minor issues, monitor
- **40-59**: Fair - Multiple issues detected
- **0-39**: Poor - Critical health issues, replace urgently
### Example Output
```
=== TOP REPLACEMENT CANDIDATES ===
#1 - osd.6 (HDD)
Host: compute-storage-01
Size: 0.91 TB (weight: 1.00)
Utilization: 46.3% | PGs: 4
Replacement Score: 78.5/100
Health Score: 45.2/100
Health Issues:
- Reallocated sectors: 12
- High temperature: 62°C
- Drive age: 6.2 years
Capacity Optimization:
• Very small drive (1.0TB) - high capacity gain
• Below host average (8.1TB) - improves balance
Resilience Impact:
• Host has 10 hdd OSDs (above average 7.8)
```
## Understanding the Analysis
### What Makes a Good Replacement Candidate?
1.**Small Capacity Drives**: Replacing a 1TB drive with a 16TB drive yields maximum capacity improvement
2.**Health Issues**: Drives with reallocated sectors, high wear, or errors should be replaced proactively
3.**Host Imbalance**: Hosts with many small drives benefit from consolidation
4.**Age**: Older drives (5+ years) are approaching end-of-life
5.**Performance**: High latency drives drag down cluster performance
### Why One-Line Execution?
The one-line execution method allows you to:
- Run analysis from any cluster node without installation
- Always use the latest version from git
- Integrate into monitoring/alerting scripts
- Quick ad-hoc analysis during maintenance
## Requirements
- Python 3.6+
- Ceph cluster with admin privileges
-`smartctl` installed (usually part of `smartmontools` package)
- Root/sudo access for SMART data retrieval
## Architecture Notes
### Device Classes
The script automatically handles separate pools for HDDs and NVMe devices. It recognizes that these device classes have different performance characteristics, wear patterns, and SMART attributes.
### SMART Data Parsing
- **HDDs**: Focuses on reallocated sectors, pending sectors, spin retry count, and mechanical health
- **NVMe/SSDs**: Emphasizes wear leveling, percentage used, available spare, and media errors
### Host-Level Analysis
The script considers cluster resilience by analyzing OSD distribution across hosts. Replacing drives on overloaded hosts can improve failure domain balance.
## Troubleshooting
### "No SMART data available"
- Ensure `smartmontools` is installed: `apt install smartmontools` or `yum install smartmontools`
- Verify root/sudo access
- Check if the OSD device supports SMART
### "Failed to gather cluster data"
- Verify you're running on a Ceph cluster node
- Ensure proper Ceph admin permissions
- Check if `ceph` command is in PATH
### Permission Denied
- Script requires sudo/root for SMART data access
- Run with: `sudo python3 ceph_osd_analyzer.py`
## Contributing
This tool is maintained internally by LotusGuild. For improvements or bug reports, submit issues or pull requests to the git repository.
## License
Internal use only - LotusGuild infrastructure tools.