LotusGuild/analyzeOSDs

Fork 0

Go to file

Jared Vititoe e12b53238e Pushed malformed code, whoops

2025-12-22 17:00:45 -05:00

ceph_osd_analyzer.py

Pushed malformed code, whoops

2025-12-22 17:00:45 -05:00

README.md

Created README

2025-12-22 16:46:02 -05:00

README.md

Ceph OSD Replacement Analyzer

Advanced analysis tool for identifying optimal Ceph OSD replacement candidates based on multiple health, capacity, resilience, and performance factors.

Overview

This script provides comprehensive analysis of your Ceph cluster to identify which OSDs should be replaced first. It combines SMART health data, capacity optimization potential, cluster resilience impact, and performance metrics to generate a prioritized replacement list.

Features

Multi-Factor Scoring System

Health Analysis (40% weight)
- SMART attribute parsing (reallocated sectors, pending sectors, uncorrectable errors)
- Drive wear leveling analysis (SSD/NVMe)
- Temperature monitoring
- Drive age calculation
- Media error detection
Capacity Optimization (30% weight)
- Identifies undersized drives for maximum capacity gains
- Analyzes host-level capacity balance
- Considers utilization for migration planning
Resilience Improvement (20% weight)
- Host-level OSD distribution analysis
- Identifies hosts with above-average OSD counts
- Detects hosts with hardware issues (down OSDs)
Performance Metrics (10% weight)
- Commit and apply latency analysis
- PG distribution balance
- Identifies performance outliers

Comprehensive Data Collection

The script leverages multiple Ceph commands:

ceph osd tree - Cluster topology and hierarchy
ceph osd df - Disk usage and utilization statistics
ceph osd metadata - OSD device information
ceph osd perf - Performance metrics (latency)
ceph pg dump - PG distribution analysis
ceph device query-daemon-health-metrics - SMART data retrieval
Fallback to smartctl for direct SMART access

Intelligent Analysis

Separate analysis for HDD and NVMe device classes
Detailed scoring breakdowns with human-readable explanations
Color-coded output for quick visual assessment
Top 15 candidates ranked by replacement priority
Summary statistics by device class

Installation

Quick Execute (No Download)

Run directly from your internal git server:

sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"

Traditional Installation

# Clone repository
git clone http://10.10.10.63:3000/LotusGuild/analyzeOSDs.git
cd analyzeOSDs

# Make executable
chmod +x ceph_osd_analyzer.py

# Run
sudo ./ceph_osd_analyzer.py

Usage

Basic Analysis

sudo python3 ceph_osd_analyzer.py

Filter by Device Class

Analyze only HDDs:

sudo python3 ceph_osd_analyzer.py --class hdd

Analyze only NVMe drives:

sudo python3 ceph_osd_analyzer.py --class nvme

Set Minimum Size Threshold

Only consider drives above a certain size:

sudo python3 ceph_osd_analyzer.py --min-size 8

Combined Options

sudo python3 ceph_osd_analyzer.py --class hdd --min-size 4

Output Interpretation

Replacement Score

70-100: Critical - Replace immediately (health issues, very small, or severe performance problems)
50-69: High Priority - Replace soon (combination of factors)
30-49: Medium Priority - Consider for next upgrade cycle
0-29: Low Priority - Healthy drives with less optimization potential

Health Score

80-100: Excellent - No significant issues detected
60-79: Good - Minor issues, monitor
40-59: Fair - Multiple issues detected
0-39: Poor - Critical health issues, replace urgently

Example Output

=== TOP REPLACEMENT CANDIDATES ===

#1 - osd.6 (HDD)
  Host: compute-storage-01
  Size: 0.91 TB (weight: 1.00)
  Utilization: 46.3% | PGs: 4
  Replacement Score: 78.5/100
  Health Score: 45.2/100
  Health Issues:
    - Reallocated sectors: 12
    - High temperature: 62°C
    - Drive age: 6.2 years
  Capacity Optimization:
    • Very small drive (1.0TB) - high capacity gain
    • Below host average (8.1TB) - improves balance
  Resilience Impact:
    • Host has 10 hdd OSDs (above average 7.8)

Understanding the Analysis

What Makes a Good Replacement Candidate?

Small Capacity Drives: Replacing a 1TB drive with a 16TB drive yields maximum capacity improvement
Health Issues: Drives with reallocated sectors, high wear, or errors should be replaced proactively
Host Imbalance: Hosts with many small drives benefit from consolidation
Age: Older drives (5+ years) are approaching end-of-life
Performance: High latency drives drag down cluster performance

Why One-Line Execution?

The one-line execution method allows you to:

Run analysis from any cluster node without installation
Always use the latest version from git
Integrate into monitoring/alerting scripts
Quick ad-hoc analysis during maintenance

Requirements

Python 3.6+
Ceph cluster with admin privileges
smartctl installed (usually part of smartmontools package)
Root/sudo access for SMART data retrieval

Architecture Notes

Device Classes

The script automatically handles separate pools for HDDs and NVMe devices. It recognizes that these device classes have different performance characteristics, wear patterns, and SMART attributes.

SMART Data Parsing

HDDs: Focuses on reallocated sectors, pending sectors, spin retry count, and mechanical health
NVMe/SSDs: Emphasizes wear leveling, percentage used, available spare, and media errors

Host-Level Analysis

The script considers cluster resilience by analyzing OSD distribution across hosts. Replacing drives on overloaded hosts can improve failure domain balance.

Troubleshooting

"No SMART data available"

Ensure smartmontools is installed: apt install smartmontools or yum install smartmontools
Verify root/sudo access
Check if the OSD device supports SMART

"Failed to gather cluster data"

Verify you're running on a Ceph cluster node
Ensure proper Ceph admin permissions
Check if ceph command is in PATH

Permission Denied

Script requires sudo/root for SMART data access
Run with: sudo python3 ceph_osd_analyzer.py

Contributing

This tool is maintained internally by LotusGuild. For improvements or bug reports, submit issues or pull requests to the git repository.

License

Internal use only - LotusGuild infrastructure tools.

hwmonDaemon - Hardware monitoring daemon
Other LotusGuild infrastructure tools

Changelog

v1.0.0 (Initial Release)

Multi-factor scoring system
SMART health analysis for HDD and NVMe
Capacity optimization analysis
Resilience impact calculation
Performance metrics integration
Color-coded output
Device class filtering
Minimum size filtering

README.md

Ceph OSD Replacement Analyzer

Overview

Features

Multi-Factor Scoring System

Comprehensive Data Collection

Intelligent Analysis

Installation

Quick Execute (No Download)

Traditional Installation

Usage

Basic Analysis

Filter by Device Class

Set Minimum Size Threshold

Combined Options

Output Interpretation

Replacement Score

Health Score

Example Output

Understanding the Analysis

What Makes a Good Replacement Candidate?

Why One-Line Execution?

Requirements

Architecture Notes

Device Classes

SMART Data Parsing

Host-Level Analysis

Troubleshooting

"No SMART data available"

"Failed to gather cluster data"

Permission Denied

Contributing

License

Related Tools

Changelog

v1.0.0 (Initial Release)