# Ceph OSD Replacement Analyzer - Project Documentation ## Project Overview **Purpose**: Intelligent analysis tool for identifying optimal Ceph OSD replacement candidates across an entire cluster by analyzing health metrics, capacity optimization potential, and cluster resilience factors. **Type**: Python 3 CLI tool for Ceph storage cluster maintenance **Target Users**: Storage administrators, DevOps engineers, and infrastructure teams managing Ceph clusters ## Architecture ### Core Components 1. **Data Collection Layer** ([ceph_osd_analyzer.py:34-172](ceph_osd_analyzer.py#L34-L172)) - Executes Ceph commands locally and via SSH - Retrieves SMART data from all cluster nodes - Handles both local `ceph device query-daemon-health-metrics` and remote `smartctl` fallback - Device path resolution with dm-device mapping support 2. **Analysis Engine** ([ceph_osd_analyzer.py:173-357](ceph_osd_analyzer.py#L173-L357)) - SMART health parsing for HDD and NVMe devices - Capacity optimization scoring - Cluster resilience impact calculation - Multi-factor weighted scoring system 3. **Reporting System** ([ceph_osd_analyzer.py:361-525](ceph_osd_analyzer.py#L361-L525)) - Color-coded console output - Top 15 ranked replacement candidates - Summary by device class (HDD/NVMe) - Per-host analysis breakdown ### Key Design Decisions **Remote SMART Data Collection**: The script uses SSH to gather SMART data from all cluster nodes, not just the local node. This is critical because OSDs are distributed across multiple physical hosts. **Fallback Strategy**: Primary method uses `ceph device query-daemon-health-metrics`, with automatic fallback to direct `smartctl` queries via SSH if Ceph's built-in metrics are unavailable. **Device Mapping**: Handles complex storage configurations including device-mapper devices, resolving them to physical drives using `lsblk` and symlink resolution. **Weighted Scoring**: 60% health, 30% capacity optimization, 10% resilience - prioritizes failing drives while considering operational efficiency. ## Scoring Algorithm ### Health Score (60% weight) **HDD Metrics** ([ceph_osd_analyzer.py:183-236](ceph_osd_analyzer.py#L183-L236)): - Reallocated sectors (ID 5): -20 points for any presence - Spin retry count (ID 10): -15 points - Pending sectors (ID 197): -25 points (critical indicator) - Uncorrectable sectors (ID 198): -30 points (critical) - Temperature (ID 190/194): -10 points if >60°C - Age (ID 9): -15 points if >5 years **NVMe Metrics** ([ceph_osd_analyzer.py:239-267](ceph_osd_analyzer.py#L239-L267)): - Available spare: penalized if <50% - Percentage used: -30 points if >80% - Media errors: -25 points for any errors - Temperature: -10 points if >70°C ### Capacity Score (30% weight) ([ceph_osd_analyzer.py:271-311](ceph_osd_analyzer.py#L271-L311)) - **Small drives prioritized**: <2TB = +40 points (maximum capacity gain) - **Medium drives**: 2-5TB = +30 points, 5-10TB = +15 points - **High utilization penalty**: >70% = -15 points (migration complexity) - **Host balance bonus**: +15 points if below host average weight ### Resilience Score (10% weight) ([ceph_osd_analyzer.py:313-357](ceph_osd_analyzer.py#L313-L357)) - Hosts with >20% above average OSD count: +20 points - Presence of down OSDs on same host: +15 points (hardware issues) ## Usage Patterns ### One-Line Execution (Recommended) ```bash sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd ``` **Why**: Always uses latest version, no local installation, integrates easily into automation. ### Command-Line Options - `--class [hdd|nvme]`: Filter by device type - `--min-size N`: Minimum OSD size in TB - `--debug`: Enable verbose debugging output ### Typical Workflow 1. Run analysis during maintenance window 2. Identify top 3-5 candidates with scores >70 3. Review health issues and capacity gains 4. Plan replacement based on available hardware 5. Execute OSD out/destroy/replace operations ## Dependencies ### Required Packages - Python 3.6+ (standard library only, no external dependencies) - `smartmontools` package (`smartctl` binary) - SSH access configured between all cluster nodes ### Required Permissions - Ceph admin keyring access - `sudo` privileges for SMART data retrieval - SSH key-based authentication to all OSD hosts ### Ceph Commands Used - `ceph osd tree -f json`: Cluster topology - `ceph osd df -f json`: Disk usage statistics - `ceph osd metadata osd.N -f json`: OSD device information - `ceph device query-daemon-health-metrics osd.N`: SMART data ## Output Interpretation ### Replacement Score Ranges - **70-100** (RED): Critical - immediate replacement recommended - **50-69** (YELLOW): High priority - plan replacement soon - **30-49**: Medium priority - next upgrade cycle - **0-29** (GREEN): Low priority - healthy drives ### Health Score Ranges - **80-100** (GREEN): Excellent condition - **60-79** (YELLOW): Monitor for issues - **40-59**: Fair - multiple concerns - **0-39** (RED): Critical - replace urgently ## Common Issues & Solutions ### "No SMART data available" - **Cause**: Missing `smartmontools` or insufficient permissions - **Solution**: `apt install smartmontools` and verify sudo access ### SSH Timeout Errors - **Cause**: Node unreachable or SSH keys not configured - **Solution**: Verify connectivity with `ssh -o ConnectTimeout=5 hostname` ### Device Path Resolution Failures - **Cause**: Non-standard OSD deployment or encryption - **Solution**: Enable `--debug` to see device resolution attempts ### dm-device Mapping Issues - **Cause**: LVM or LUKS encrypted OSDs - **Solution**: Script automatically resolves via `lsblk -no pkname` ## Development Notes ### Code Structure - **Single file design**: Easier to execute remotely via `exec()` - **Minimal dependencies**: Uses only Python standard library - **Color-coded output**: ANSI escape codes for terminal display - **Debug mode**: Comprehensive logging when `--debug` enabled ### Notable Functions **`run_command()`** ([ceph_osd_analyzer.py:34-56](ceph_osd_analyzer.py#L34-L56)): Universal command executor with SSH support and JSON parsing **`get_device_path_for_osd()`** ([ceph_osd_analyzer.py:84-122](ceph_osd_analyzer.py#L84-L122)): Complex device resolution logic handling metadata, symlinks, and dm-devices **`get_smart_data_remote()`** ([ceph_osd_analyzer.py:124-145](ceph_osd_analyzer.py#L124-L145)): Remote SMART data collection with device type detection **`parse_smart_health()`** ([ceph_osd_analyzer.py:173-269](ceph_osd_analyzer.py#L173-L269)): SMART attribute parsing with device-class-specific logic ### Future Enhancement Opportunities 1. **Parallel data collection**: Use threading for faster cluster-wide analysis 2. **Historical trending**: Track scores over time to predict failures 3. **JSON output mode**: For integration with monitoring systems 4. **Cost-benefit analysis**: Factor in replacement drive costs 5. **PG rebalance impact**: Estimate data movement required ## Security Considerations ### Permissions Required - Root access for `smartctl` execution - SSH access to all OSD hosts - Ceph admin keyring (read-only sufficient) ### Network Requirements - Script assumes SSH connectivity between nodes - No outbound internet access required (internal-only tool) - Hardcoded internal git server URL: `http://10.10.10.63:3000` ### SSH Configuration - Uses `-o StrictHostKeyChecking=no` for automated execution - 5-second connection timeout to handle unreachable nodes - Assumes key-based authentication is configured ## Related Infrastructure **Internal Git Server**: `http://10.10.10.63:3000/LotusGuild/analyzeOSDs` **Related Projects**: - hwmonDaemon: Hardware monitoring daemon for continuous health checks - Other LotusGuild infrastructure automation tools ## Maintenance ### Version Control - Maintained in internal git repository - One-line execution always pulls from `main` branch - No formal versioning; latest commit is production ### Testing Checklist - [ ] Test on cluster with mixed HDD/NVMe OSDs - [ ] Verify SSH connectivity to all hosts - [ ] Confirm SMART data retrieval for both device types - [ ] Validate dm-device resolution on encrypted OSDs - [ ] Check output formatting with various terminal widths - [ ] Test `--class` and `--min-size` filtering ## Performance Characteristics **Execution Time**: ~5-15 seconds per OSD depending on cluster size and SSH latency **Bottlenecks**: - Serial OSD processing (parallelization would help) - SSH round-trip times for SMART data - SMART data parsing can be slow for unresponsive drives **Resource Usage**: Minimal CPU/memory, I/O bound on SSH operations **Intended Audience**: LotusGuild infrastructure team **Support**: Submit issues or pull requests to internal git repository