Created README

2025-12-22 16:46:02 -05:00
parent 7dab2591b1
commit a861276013
1 changed files with 223 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,223 @@
 # Ceph OSD Replacement Analyzer
 Advanced analysis tool for identifying optimal Ceph OSD replacement candidates based on multiple health, capacity, resilience, and performance factors.
 ## Overview
 This script provides comprehensive analysis of your Ceph cluster to identify which OSDs should be replaced first. It combines SMART health data, capacity optimization potential, cluster resilience impact, and performance metrics to generate a prioritized replacement list.
 ## Features
 ### Multi-Factor Scoring System
 - **Health Analysis (40% weight)**
  - SMART attribute parsing (reallocated sectors, pending sectors, uncorrectable errors)
  - Drive wear leveling analysis (SSD/NVMe)
  - Temperature monitoring
  - Drive age calculation
  - Media error detection
 - **Capacity Optimization (30% weight)**
  - Identifies undersized drives for maximum capacity gains
  - Analyzes host-level capacity balance
  - Considers utilization for migration planning
 - **Resilience Improvement (20% weight)**
  - Host-level OSD distribution analysis
  - Identifies hosts with above-average OSD counts
  - Detects hosts with hardware issues (down OSDs)
 - **Performance Metrics (10% weight)**
  - Commit and apply latency analysis
  - PG distribution balance
  - Identifies performance outliers
 ### Comprehensive Data Collection
 The script leverages multiple Ceph commands:
 - `ceph osd tree` - Cluster topology and hierarchy
 - `ceph osd df` - Disk usage and utilization statistics
 - `ceph osd metadata` - OSD device information
 - `ceph osd perf` - Performance metrics (latency)
 - `ceph pg dump` - PG distribution analysis
 - `ceph device query-daemon-health-metrics` - SMART data retrieval
 - Fallback to `smartctl` for direct SMART access
 ### Intelligent Analysis
 - Separate analysis for HDD and NVMe device classes
 - Detailed scoring breakdowns with human-readable explanations
 - Color-coded output for quick visual assessment
 - Top 15 candidates ranked by replacement priority
 - Summary statistics by device class
 ## Installation
 ### Quick Execute (No Download)
 Run directly from your internal git server:
 ```bash
 sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"
 ```
 ### Traditional Installation
 ```bash
 # Clone repository
 git clone http://10.10.10.63:3000/LotusGuild/analyzeOSDs.git
 cd analyzeOSDs
 # Make executable
 chmod +x ceph_osd_analyzer.py
 # Run
 sudo ./ceph_osd_analyzer.py
 ```
 ## Usage
 ### Basic Analysis
 ```bash
 sudo python3 ceph_osd_analyzer.py
 ```
 ### Filter by Device Class
 Analyze only HDDs:
 ```bash
 sudo python3 ceph_osd_analyzer.py --class hdd
 ```
 Analyze only NVMe drives:
 ```bash
 sudo python3 ceph_osd_analyzer.py --class nvme
 ```
 ### Set Minimum Size Threshold
 Only consider drives above a certain size:
 ```bash
 sudo python3 ceph_osd_analyzer.py --min-size 8
 ```
 ### Combined Options
 ```bash
 sudo python3 ceph_osd_analyzer.py --class hdd --min-size 4
 ```
 ## Output Interpretation
 ### Replacement Score
 - **70-100**: Critical - Replace immediately (health issues, very small, or severe performance problems)
 - **50-69**: High Priority - Replace soon (combination of factors)
 - **30-49**: Medium Priority - Consider for next upgrade cycle
 - **0-29**: Low Priority - Healthy drives with less optimization potential
 ### Health Score
 - **80-100**: Excellent - No significant issues detected
 - **60-79**: Good - Minor issues, monitor
 - **40-59**: Fair - Multiple issues detected
 - **0-39**: Poor - Critical health issues, replace urgently
 ### Example Output
 ```
 === TOP REPLACEMENT CANDIDATES ===
 #1 - osd.6 (HDD)
  Host: compute-storage-01
  Size: 0.91 TB (weight: 1.00)
  Utilization: 46.3% | PGs: 4
  Replacement Score: 78.5/100
  Health Score: 45.2/100
  Health Issues:
    - Reallocated sectors: 12
    - High temperature: 62°C
    - Drive age: 6.2 years
  Capacity Optimization:
    • Very small drive (1.0TB) - high capacity gain
    • Below host average (8.1TB) - improves balance
  Resilience Impact:
    • Host has 10 hdd OSDs (above average 7.8)
 ```
 ## Understanding the Analysis
 ### What Makes a Good Replacement Candidate?
 1. **Small Capacity Drives**: Replacing a 1TB drive with a 16TB drive yields maximum capacity improvement
 2. **Health Issues**: Drives with reallocated sectors, high wear, or errors should be replaced proactively
 3. **Host Imbalance**: Hosts with many small drives benefit from consolidation
 4. **Age**: Older drives (5+ years) are approaching end-of-life
 5. **Performance**: High latency drives drag down cluster performance
 ### Why One-Line Execution?
 The one-line execution method allows you to:
 - Run analysis from any cluster node without installation
 - Always use the latest version from git
 - Integrate into monitoring/alerting scripts
 - Quick ad-hoc analysis during maintenance
 ## Requirements
 - Python 3.6+
 - Ceph cluster with admin privileges
 - `smartctl` installed (usually part of `smartmontools` package)
 - Root/sudo access for SMART data retrieval
 ## Architecture Notes
 ### Device Classes
 The script automatically handles separate pools for HDDs and NVMe devices. It recognizes that these device classes have different performance characteristics, wear patterns, and SMART attributes.
 ### SMART Data Parsing
 - **HDDs**: Focuses on reallocated sectors, pending sectors, spin retry count, and mechanical health
 - **NVMe/SSDs**: Emphasizes wear leveling, percentage used, available spare, and media errors
 ### Host-Level Analysis
 The script considers cluster resilience by analyzing OSD distribution across hosts. Replacing drives on overloaded hosts can improve failure domain balance.
 ## Troubleshooting
 ### "No SMART data available"
 - Ensure `smartmontools` is installed: `apt install smartmontools` or `yum install smartmontools`
 - Verify root/sudo access
 - Check if the OSD device supports SMART
 ### "Failed to gather cluster data"
 - Verify you're running on a Ceph cluster node
 - Ensure proper Ceph admin permissions
 - Check if `ceph` command is in PATH
 ### Permission Denied
 - Script requires sudo/root for SMART data access
 - Run with: `sudo python3 ceph_osd_analyzer.py`
 ## Contributing
 This tool is maintained internally by LotusGuild. For improvements or bug reports, submit issues or pull requests to the git repository.
 ## License
 Internal use only - LotusGuild infrastructure tools.
 ## Related Tools
 - [hwmonDaemon](http://10.10.10.63:3000/LotusGuild/hwmonDaemon) - Hardware monitoring daemon
 - Other LotusGuild infrastructure tools
 ## Changelog
 ### v1.0.0 (Initial Release)
 - Multi-factor scoring system
 - SMART health analysis for HDD and NVMe
 - Capacity optimization analysis
 - Resilience impact calculation
 - Performance metrics integration
 - Color-coded output
 - Device class filtering
 - Minimum size filtering