Major improvements to scoring and data collection: **Scoring Changes:** - Failed SMART reads now return 0/100 health (was 50/100) - Critical health issues get much higher penalties: * Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x) * Pending sectors: -60 pts, 10x multiplier (was -25, 5x) * Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x) * NVMe media errors: -60 pts, 10x multiplier (was -25, 5x) - Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10) - Added priority bonuses: * Failed SMART + small drive (<5TB): +30 points * Failed SMART alone: +20 points * Health issues + small drive: +15 points **Priority Order Now Enforced:** 1. Failed SMART drives (score 90-100) 2. Small drives beginning to fail (70-85) 3. Small healthy drives (40-60) 4. Large failing drives (60-75) **Enhanced SMART Collection:** - Added metadata.devices field parsing - Enhanced dm-device and /dev/mapper/ resolution - Added ceph-volume lvm list fallback - Retry logic with 3 command variations per device - Try with/without sudo, different device flags **Expected Impact:** - osd.28 with reallocated sectors jumps from #14 to top 3 - SMART collection failures should drop from 6 to 0-2 - All failing drives rank above healthy drives regardless of size 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8.7 KiB
Ceph OSD Replacement Analyzer - Project Documentation
Project Overview
Purpose: Intelligent analysis tool for identifying optimal Ceph OSD replacement candidates across an entire cluster by analyzing health metrics, capacity optimization potential, and cluster resilience factors.
Type: Python 3 CLI tool for Ceph storage cluster maintenance
Target Users: Storage administrators, DevOps engineers, and infrastructure teams managing Ceph clusters
Architecture
Core Components
-
Data Collection Layer (ceph_osd_analyzer.py:34-172)
- Executes Ceph commands locally and via SSH
- Retrieves SMART data from all cluster nodes
- Handles both local
ceph device query-daemon-health-metricsand remotesmartctlfallback - Device path resolution with dm-device mapping support
-
Analysis Engine (ceph_osd_analyzer.py:173-357)
- SMART health parsing for HDD and NVMe devices
- Capacity optimization scoring
- Cluster resilience impact calculation
- Multi-factor weighted scoring system
-
Reporting System (ceph_osd_analyzer.py:361-525)
- Color-coded console output
- Top 15 ranked replacement candidates
- Summary by device class (HDD/NVMe)
- Per-host analysis breakdown
Key Design Decisions
Remote SMART Data Collection: The script uses SSH to gather SMART data from all cluster nodes, not just the local node. This is critical because OSDs are distributed across multiple physical hosts.
Fallback Strategy: Primary method uses ceph device query-daemon-health-metrics, with automatic fallback to direct smartctl queries via SSH if Ceph's built-in metrics are unavailable.
Device Mapping: Handles complex storage configurations including device-mapper devices, resolving them to physical drives using lsblk and symlink resolution.
Weighted Scoring: 60% health, 30% capacity optimization, 10% resilience - prioritizes failing drives while considering operational efficiency.
Scoring Algorithm
Health Score (60% weight)
HDD Metrics (ceph_osd_analyzer.py:183-236):
- Reallocated sectors (ID 5): -20 points for any presence
- Spin retry count (ID 10): -15 points
- Pending sectors (ID 197): -25 points (critical indicator)
- Uncorrectable sectors (ID 198): -30 points (critical)
- Temperature (ID 190/194): -10 points if >60°C
- Age (ID 9): -15 points if >5 years
NVMe Metrics (ceph_osd_analyzer.py:239-267):
- Available spare: penalized if <50%
- Percentage used: -30 points if >80%
- Media errors: -25 points for any errors
- Temperature: -10 points if >70°C
Capacity Score (30% weight)
(ceph_osd_analyzer.py:271-311)
- Small drives prioritized: <2TB = +40 points (maximum capacity gain)
- Medium drives: 2-5TB = +30 points, 5-10TB = +15 points
- High utilization penalty: >70% = -15 points (migration complexity)
- Host balance bonus: +15 points if below host average weight
Resilience Score (10% weight)
(ceph_osd_analyzer.py:313-357)
- Hosts with >20% above average OSD count: +20 points
- Presence of down OSDs on same host: +15 points (hardware issues)
Usage Patterns
One-Line Execution (Recommended)
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd
Why: Always uses latest version, no local installation, integrates easily into automation.
Command-Line Options
--class [hdd|nvme]: Filter by device type--min-size N: Minimum OSD size in TB--debug: Enable verbose debugging output
Typical Workflow
- Run analysis during maintenance window
- Identify top 3-5 candidates with scores >70
- Review health issues and capacity gains
- Plan replacement based on available hardware
- Execute OSD out/destroy/replace operations
Dependencies
Required Packages
- Python 3.6+ (standard library only, no external dependencies)
smartmontoolspackage (smartctlbinary)- SSH access configured between all cluster nodes
Required Permissions
- Ceph admin keyring access
sudoprivileges for SMART data retrieval- SSH key-based authentication to all OSD hosts
Ceph Commands Used
ceph osd tree -f json: Cluster topologyceph osd df -f json: Disk usage statisticsceph osd metadata osd.N -f json: OSD device informationceph device query-daemon-health-metrics osd.N: SMART data
Output Interpretation
Replacement Score Ranges
- 70-100 (RED): Critical - immediate replacement recommended
- 50-69 (YELLOW): High priority - plan replacement soon
- 30-49: Medium priority - next upgrade cycle
- 0-29 (GREEN): Low priority - healthy drives
Health Score Ranges
- 80-100 (GREEN): Excellent condition
- 60-79 (YELLOW): Monitor for issues
- 40-59: Fair - multiple concerns
- 0-39 (RED): Critical - replace urgently
Common Issues & Solutions
"No SMART data available"
- Cause: Missing
smartmontoolsor insufficient permissions - Solution:
apt install smartmontoolsand verify sudo access
SSH Timeout Errors
- Cause: Node unreachable or SSH keys not configured
- Solution: Verify connectivity with
ssh -o ConnectTimeout=5 <host> hostname
Device Path Resolution Failures
- Cause: Non-standard OSD deployment or encryption
- Solution: Enable
--debugto see device resolution attempts
dm-device Mapping Issues
- Cause: LVM or LUKS encrypted OSDs
- Solution: Script automatically resolves via
lsblk -no pkname
Development Notes
Code Structure
- Single file design: Easier to execute remotely via
exec() - Minimal dependencies: Uses only Python standard library
- Color-coded output: ANSI escape codes for terminal display
- Debug mode: Comprehensive logging when
--debugenabled
Notable Functions
run_command() (ceph_osd_analyzer.py:34-56): Universal command executor with SSH support and JSON parsing
get_device_path_for_osd() (ceph_osd_analyzer.py:84-122): Complex device resolution logic handling metadata, symlinks, and dm-devices
get_smart_data_remote() (ceph_osd_analyzer.py:124-145): Remote SMART data collection with device type detection
parse_smart_health() (ceph_osd_analyzer.py:173-269): SMART attribute parsing with device-class-specific logic
Future Enhancement Opportunities
- Parallel data collection: Use threading for faster cluster-wide analysis
- Historical trending: Track scores over time to predict failures
- JSON output mode: For integration with monitoring systems
- Cost-benefit analysis: Factor in replacement drive costs
- PG rebalance impact: Estimate data movement required
Security Considerations
Permissions Required
- Root access for
smartctlexecution - SSH access to all OSD hosts
- Ceph admin keyring (read-only sufficient)
Network Requirements
- Script assumes SSH connectivity between nodes
- No outbound internet access required (internal-only tool)
- Hardcoded internal git server URL:
http://10.10.10.63:3000
SSH Configuration
- Uses
-o StrictHostKeyChecking=nofor automated execution - 5-second connection timeout to handle unreachable nodes
- Assumes key-based authentication is configured
Related Infrastructure
Internal Git Server: http://10.10.10.63:3000/LotusGuild/analyzeOSDs
Related Projects:
- hwmonDaemon: Hardware monitoring daemon for continuous health checks
- Other LotusGuild infrastructure automation tools
Maintenance
Version Control
- Maintained in internal git repository
- One-line execution always pulls from
mainbranch - No formal versioning; latest commit is production
Testing Checklist
- Test on cluster with mixed HDD/NVMe OSDs
- Verify SSH connectivity to all hosts
- Confirm SMART data retrieval for both device types
- Validate dm-device resolution on encrypted OSDs
- Check output formatting with various terminal widths
- Test
--classand--min-sizefiltering
Performance Characteristics
Execution Time: ~5-15 seconds per OSD depending on cluster size and SSH latency
Bottlenecks:
- Serial OSD processing (parallelization would help)
- SSH round-trip times for SMART data
- SMART data parsing can be slow for unresponsive drives
Resource Usage: Minimal CPU/memory, I/O bound on SSH operations
Intended Audience: LotusGuild infrastructure team
Support: Submit issues or pull requests to internal git repository