README.md

# Ceph OSD Replacement Analyzer

Advanced analysis tool for identifying optimal Ceph OSD replacement candidates based on multiple health, capacity, resilience, and performance factors.

## Overview

This script provides comprehensive analysis of your Ceph cluster to identify which OSDs should be replaced first. It combines SMART health data, capacity optimization potential, cluster resilience impact, and performance metrics to generate a prioritized replacement list.

## Features

### Multi-Factor Scoring System

- **Health Analysis (40% weight)**
  - SMART attribute parsing (reallocated sectors, pending sectors, uncorrectable errors)
  - Drive wear leveling analysis (SSD/NVMe)
  - Temperature monitoring
  - Drive age calculation
  - Media error detection

- **Capacity Optimization (30% weight)**
  - Identifies undersized drives for maximum capacity gains
  - Analyzes host-level capacity balance
  - Considers utilization for migration planning

- **Resilience Improvement (20% weight)**
  - Host-level OSD distribution analysis
  - Identifies hosts with above-average OSD counts
  - Detects hosts with hardware issues (down OSDs)

- **Performance Metrics (10% weight)**
  - Commit and apply latency analysis
  - PG distribution balance
  - Identifies performance outliers

### Comprehensive Data Collection

The script leverages multiple Ceph commands:
- `ceph osd tree` - Cluster topology and hierarchy
- `ceph osd df` - Disk usage and utilization statistics
- `ceph osd metadata` - OSD device information
- `ceph osd perf` - Performance metrics (latency)
- `ceph pg dump` - PG distribution analysis
- `ceph device query-daemon-health-metrics` - SMART data retrieval
- Fallback to `smartctl` for direct SMART access

### Intelligent Analysis

- Separate analysis for HDD and NVMe device classes
- Detailed scoring breakdowns with human-readable explanations
- Color-coded output for quick visual assessment
- Top 15 candidates ranked by replacement priority
- Summary statistics by device class

## Installation

### Quick Execute (No Download)

Run directly from your internal git server:

```bash
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"
```
Run directly from internal git server with debug enabled:
```bash
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug
```
Most common execution
```bash
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd
```

### Traditional Installation (not recommended)

```bash
# Clone repository
git clone http://10.10.10.63:3000/LotusGuild/analyzeOSDs.git
cd analyzeOSDs

# Make executable
chmod +x ceph_osd_analyzer.py

# Run
sudo ./ceph_osd_analyzer.py
```

## Usage

### Basic Analysis

```bash
sudo python3 ceph_osd_analyzer.py
```

### Filter by Device Class

Analyze only HDDs:
```bash
sudo python3 ceph_osd_analyzer.py --class hdd
```

Analyze only NVMe drives:
```bash
sudo python3 ceph_osd_analyzer.py --class nvme
```

### Set Minimum Size Threshold

Only consider drives above a certain size:
```bash
sudo python3 ceph_osd_analyzer.py --min-size 8
```

### Combined Options

```bash
sudo python3 ceph_osd_analyzer.py --class hdd --min-size 4
```

## Output Interpretation

### Replacement Score
- **70-100**: Critical - Replace immediately (health issues, very small, or severe performance problems)
- **50-69**: High Priority - Replace soon (combination of factors)
- **30-49**: Medium Priority - Consider for next upgrade cycle
- **0-29**: Low Priority - Healthy drives with less optimization potential

### Health Score
- **80-100**: Excellent - No significant issues detected
- **60-79**: Good - Minor issues, monitor
- **40-59**: Fair - Multiple issues detected
- **0-39**: Poor - Critical health issues, replace urgently

### Example Output

```
=== TOP REPLACEMENT CANDIDATES ===

#1 - osd.6 (HDD)
  Host: compute-storage-01
  Size: 0.91 TB (weight: 1.00)
  Utilization: 46.3% | PGs: 4
  Replacement Score: 78.5/100
  Health Score: 45.2/100
  Health Issues:
    - Reallocated sectors: 12
    - High temperature: 62°C
    - Drive age: 6.2 years
  Capacity Optimization:
    • Very small drive (1.0TB) - high capacity gain
    • Below host average (8.1TB) - improves balance
  Resilience Impact:
    • Host has 10 hdd OSDs (above average 7.8)
```

## Understanding the Analysis

### What Makes a Good Replacement Candidate?

1. **Small Capacity Drives**: Replacing a 1TB drive with a 16TB drive yields maximum capacity improvement
2. **Health Issues**: Drives with reallocated sectors, high wear, or errors should be replaced proactively
3. **Host Imbalance**: Hosts with many small drives benefit from consolidation
4. **Age**: Older drives (5+ years) are approaching end-of-life
5. **Performance**: High latency drives drag down cluster performance

### Why One-Line Execution?

The one-line execution method allows you to:
- Run analysis from any cluster node without installation
- Always use the latest version from git
- Integrate into monitoring/alerting scripts
- Quick ad-hoc analysis during maintenance

## Requirements

- Python 3.6+
- Ceph cluster with admin privileges
- `smartctl` installed (usually part of `smartmontools` package)
- Root/sudo access for SMART data retrieval

## Architecture Notes

### Device Classes
The script automatically handles separate pools for HDDs and NVMe devices. It recognizes that these device classes have different performance characteristics, wear patterns, and SMART attributes.

### SMART Data Parsing
- **HDDs**: Focuses on reallocated sectors, pending sectors, spin retry count, and mechanical health
- **NVMe/SSDs**: Emphasizes wear leveling, percentage used, available spare, and media errors

### Host-Level Analysis
The script considers cluster resilience by analyzing OSD distribution across hosts. Replacing drives on overloaded hosts can improve failure domain balance.

## Troubleshooting

### "No SMART data available"
- Ensure `smartmontools` is installed: `apt install smartmontools` or `yum install smartmontools`
- Verify root/sudo access
- Check if the OSD device supports SMART

### "Failed to gather cluster data"
- Verify you're running on a Ceph cluster node
- Ensure proper Ceph admin permissions
- Check if `ceph` command is in PATH

### Permission Denied
- Script requires sudo/root for SMART data access
- Run with: `sudo python3 ceph_osd_analyzer.py`

## Contributing

This tool is maintained internally by LotusGuild. For improvements or bug reports, submit issues or pull requests to the git repository.

## License

Internal use only - LotusGuild infrastructure tools.

## Related Tools

- [hwmonDaemon](http://10.10.10.63:3000/LotusGuild/hwmonDaemon) - Hardware monitoring daemon
- Other LotusGuild infrastructure tools

## Changelog

### v1.0.0 (Initial Release)
- Multi-factor scoring system
- SMART health analysis for HDD and NVMe
- Capacity optimization analysis
- Resilience impact calculation
- Performance metrics integration
- Color-coded output
- Device class filtering
- Minimum size filtering
Created README 2025-12-22 16:46:02 -05:00			`# Ceph OSD Replacement Analyzer`

			`Advanced analysis tool for identifying optimal Ceph OSD replacement candidates based on multiple health, capacity, resilience, and performance factors.`

			`## Overview`

			`This script provides comprehensive analysis of your Ceph cluster to identify which OSDs should be replaced first. It combines SMART health data, capacity optimization potential, cluster resilience impact, and performance metrics to generate a prioritized replacement list.`

			`## Features`

			`### Multi-Factor Scoring System`

			`- Health Analysis (40% weight)`
			`- SMART attribute parsing (reallocated sectors, pending sectors, uncorrectable errors)`
			`- Drive wear leveling analysis (SSD/NVMe)`
			`- Temperature monitoring`
			`- Drive age calculation`
			`- Media error detection`

			`- Capacity Optimization (30% weight)`
			`- Identifies undersized drives for maximum capacity gains`
			`- Analyzes host-level capacity balance`
			`- Considers utilization for migration planning`

			`- Resilience Improvement (20% weight)`
			`- Host-level OSD distribution analysis`
			`- Identifies hosts with above-average OSD counts`
			`- Detects hosts with hardware issues (down OSDs)`

			`- Performance Metrics (10% weight)`
			`- Commit and apply latency analysis`
			`- PG distribution balance`
			`- Identifies performance outliers`

			`### Comprehensive Data Collection`

			`The script leverages multiple Ceph commands:`
			- `ceph osd tree` - Cluster topology and hierarchy
			- `ceph osd df` - Disk usage and utilization statistics
			- `ceph osd metadata` - OSD device information
			- `ceph osd perf` - Performance metrics (latency)
			- `ceph pg dump` - PG distribution analysis
			- `ceph device query-daemon-health-metrics` - SMART data retrieval
			- Fallback to `smartctl` for direct SMART access

			`### Intelligent Analysis`

			`- Separate analysis for HDD and NVMe device classes`
			`- Detailed scoring breakdowns with human-readable explanations`
			`- Color-coded output for quick visual assessment`
			`- Top 15 candidates ranked by replacement priority`
			`- Summary statistics by device class`

			`## Installation`

			`### Quick Execute (No Download)`

			`Run directly from your internal git server:`

			```bash
			`sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"`
			```
Updated readme again 2025-12-22 18:15:46 -05:00			`Run directly from internal git server with debug enabled:`
			```bash
			`sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug`
			```
			`Most common execution`
			```bash
			`sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd`
			```
Created README 2025-12-22 16:46:02 -05:00
Updated readme again 2025-12-22 18:15:46 -05:00			`### Traditional Installation (not recommended)`
Created README 2025-12-22 16:46:02 -05:00
			```bash
			`# Clone repository`
			`git clone http://10.10.10.63:3000/LotusGuild/analyzeOSDs.git`
			`cd analyzeOSDs`

			`# Make executable`
			`chmod +x ceph_osd_analyzer.py`

			`# Run`
			`sudo ./ceph_osd_analyzer.py`
			```

			`## Usage`

			`### Basic Analysis`

			```bash
			`sudo python3 ceph_osd_analyzer.py`
			```

			`### Filter by Device Class`

			`Analyze only HDDs:`
			```bash
			`sudo python3 ceph_osd_analyzer.py --class hdd`
			```

			`Analyze only NVMe drives:`
			```bash
			`sudo python3 ceph_osd_analyzer.py --class nvme`
			```

			`### Set Minimum Size Threshold`

			`Only consider drives above a certain size:`
			```bash
			`sudo python3 ceph_osd_analyzer.py --min-size 8`
			```

			`### Combined Options`

			```bash
			`sudo python3 ceph_osd_analyzer.py --class hdd --min-size 4`
			```

			`## Output Interpretation`

			`### Replacement Score`
			`- 70-100: Critical - Replace immediately (health issues, very small, or severe performance problems)`
			`- 50-69: High Priority - Replace soon (combination of factors)`
			`- 30-49: Medium Priority - Consider for next upgrade cycle`
			`- 0-29: Low Priority - Healthy drives with less optimization potential`

			`### Health Score`
			`- 80-100: Excellent - No significant issues detected`
			`- 60-79: Good - Minor issues, monitor`
			`- 40-59: Fair - Multiple issues detected`
			`- 0-39: Poor - Critical health issues, replace urgently`

			`### Example Output`

			```
			`=== TOP REPLACEMENT CANDIDATES ===`

			`#1 - osd.6 (HDD)`
			`Host: compute-storage-01`
			`Size: 0.91 TB (weight: 1.00)`
			`Utilization: 46.3% \| PGs: 4`
			`Replacement Score: 78.5/100`
			`Health Score: 45.2/100`
			`Health Issues:`
			`- Reallocated sectors: 12`
			`- High temperature: 62°C`
			`- Drive age: 6.2 years`
			`Capacity Optimization:`
			`• Very small drive (1.0TB) - high capacity gain`
			`• Below host average (8.1TB) - improves balance`
			`Resilience Impact:`
			`• Host has 10 hdd OSDs (above average 7.8)`
			```

			`## Understanding the Analysis`

			`### What Makes a Good Replacement Candidate?`

			`1. Small Capacity Drives: Replacing a 1TB drive with a 16TB drive yields maximum capacity improvement`
			`2. Health Issues: Drives with reallocated sectors, high wear, or errors should be replaced proactively`
			`3. Host Imbalance: Hosts with many small drives benefit from consolidation`
			`4. Age: Older drives (5+ years) are approaching end-of-life`
			`5. Performance: High latency drives drag down cluster performance`

			`### Why One-Line Execution?`

			`The one-line execution method allows you to:`
			`- Run analysis from any cluster node without installation`
			`- Always use the latest version from git`
			`- Integrate into monitoring/alerting scripts`
			`- Quick ad-hoc analysis during maintenance`

			`## Requirements`

			`- Python 3.6+`
			`- Ceph cluster with admin privileges`
			- `smartctl` installed (usually part of `smartmontools` package)
			`- Root/sudo access for SMART data retrieval`

			`## Architecture Notes`

			`### Device Classes`
			`The script automatically handles separate pools for HDDs and NVMe devices. It recognizes that these device classes have different performance characteristics, wear patterns, and SMART attributes.`

			`### SMART Data Parsing`
			`- HDDs: Focuses on reallocated sectors, pending sectors, spin retry count, and mechanical health`
			`- NVMe/SSDs: Emphasizes wear leveling, percentage used, available spare, and media errors`

			`### Host-Level Analysis`
			`The script considers cluster resilience by analyzing OSD distribution across hosts. Replacing drives on overloaded hosts can improve failure domain balance.`

			`## Troubleshooting`

			`### "No SMART data available"`
			- Ensure `smartmontools` is installed: `apt install smartmontools` or `yum install smartmontools`
			`- Verify root/sudo access`
			`- Check if the OSD device supports SMART`

			`### "Failed to gather cluster data"`
			`- Verify you're running on a Ceph cluster node`
			`- Ensure proper Ceph admin permissions`
			- Check if `ceph` command is in PATH

			`### Permission Denied`
			`- Script requires sudo/root for SMART data access`
			- Run with: `sudo python3 ceph_osd_analyzer.py`

			`## Contributing`

			`This tool is maintained internally by LotusGuild. For improvements or bug reports, submit issues or pull requests to the git repository.`

			`## License`

			`Internal use only - LotusGuild infrastructure tools.`

			`## Related Tools`

			`- [hwmonDaemon](http://10.10.10.63:3000/LotusGuild/hwmonDaemon) - Hardware monitoring daemon`
			`- Other LotusGuild infrastructure tools`

			`## Changelog`

			`### v1.0.0 (Initial Release)`
			`- Multi-factor scoring system`
			`- SMART health analysis for HDD and NVMe`
			`- Capacity optimization analysis`
			`- Resilience impact calculation`
			`- Performance metrics integration`
			`- Color-coded output`
			`- Device class filtering`
			`- Minimum size filtering`