Major improvements to scoring and data collection: **Scoring Changes:** - Failed SMART reads now return 0/100 health (was 50/100) - Critical health issues get much higher penalties: * Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x) * Pending sectors: -60 pts, 10x multiplier (was -25, 5x) * Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x) * NVMe media errors: -60 pts, 10x multiplier (was -25, 5x) - Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10) - Added priority bonuses: * Failed SMART + small drive (<5TB): +30 points * Failed SMART alone: +20 points * Health issues + small drive: +15 points **Priority Order Now Enforced:** 1. Failed SMART drives (score 90-100) 2. Small drives beginning to fail (70-85) 3. Small healthy drives (40-60) 4. Large failing drives (60-75) **Enhanced SMART Collection:** - Added metadata.devices field parsing - Enhanced dm-device and /dev/mapper/ resolution - Added ceph-volume lvm list fallback - Retry logic with 3 command variations per device - Try with/without sudo, different device flags **Expected Impact:** - osd.28 with reallocated sectors jumps from #14 to top 3 - SMART collection failures should drop from 6 to 0-2 - All failing drives rank above healthy drives regardless of size 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
230 lines
8.7 KiB
Markdown
230 lines
8.7 KiB
Markdown
# Ceph OSD Replacement Analyzer - Project Documentation
|
|
|
|
## Project Overview
|
|
|
|
**Purpose**: Intelligent analysis tool for identifying optimal Ceph OSD replacement candidates across an entire cluster by analyzing health metrics, capacity optimization potential, and cluster resilience factors.
|
|
|
|
**Type**: Python 3 CLI tool for Ceph storage cluster maintenance
|
|
|
|
**Target Users**: Storage administrators, DevOps engineers, and infrastructure teams managing Ceph clusters
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
1. **Data Collection Layer** ([ceph_osd_analyzer.py:34-172](ceph_osd_analyzer.py#L34-L172))
|
|
- Executes Ceph commands locally and via SSH
|
|
- Retrieves SMART data from all cluster nodes
|
|
- Handles both local `ceph device query-daemon-health-metrics` and remote `smartctl` fallback
|
|
- Device path resolution with dm-device mapping support
|
|
|
|
2. **Analysis Engine** ([ceph_osd_analyzer.py:173-357](ceph_osd_analyzer.py#L173-L357))
|
|
- SMART health parsing for HDD and NVMe devices
|
|
- Capacity optimization scoring
|
|
- Cluster resilience impact calculation
|
|
- Multi-factor weighted scoring system
|
|
|
|
3. **Reporting System** ([ceph_osd_analyzer.py:361-525](ceph_osd_analyzer.py#L361-L525))
|
|
- Color-coded console output
|
|
- Top 15 ranked replacement candidates
|
|
- Summary by device class (HDD/NVMe)
|
|
- Per-host analysis breakdown
|
|
|
|
### Key Design Decisions
|
|
|
|
**Remote SMART Data Collection**: The script uses SSH to gather SMART data from all cluster nodes, not just the local node. This is critical because OSDs are distributed across multiple physical hosts.
|
|
|
|
**Fallback Strategy**: Primary method uses `ceph device query-daemon-health-metrics`, with automatic fallback to direct `smartctl` queries via SSH if Ceph's built-in metrics are unavailable.
|
|
|
|
**Device Mapping**: Handles complex storage configurations including device-mapper devices, resolving them to physical drives using `lsblk` and symlink resolution.
|
|
|
|
**Weighted Scoring**: 60% health, 30% capacity optimization, 10% resilience - prioritizes failing drives while considering operational efficiency.
|
|
|
|
## Scoring Algorithm
|
|
|
|
### Health Score (60% weight)
|
|
|
|
**HDD Metrics** ([ceph_osd_analyzer.py:183-236](ceph_osd_analyzer.py#L183-L236)):
|
|
- Reallocated sectors (ID 5): -20 points for any presence
|
|
- Spin retry count (ID 10): -15 points
|
|
- Pending sectors (ID 197): -25 points (critical indicator)
|
|
- Uncorrectable sectors (ID 198): -30 points (critical)
|
|
- Temperature (ID 190/194): -10 points if >60°C
|
|
- Age (ID 9): -15 points if >5 years
|
|
|
|
**NVMe Metrics** ([ceph_osd_analyzer.py:239-267](ceph_osd_analyzer.py#L239-L267)):
|
|
- Available spare: penalized if <50%
|
|
- Percentage used: -30 points if >80%
|
|
- Media errors: -25 points for any errors
|
|
- Temperature: -10 points if >70°C
|
|
|
|
### Capacity Score (30% weight)
|
|
|
|
([ceph_osd_analyzer.py:271-311](ceph_osd_analyzer.py#L271-L311))
|
|
|
|
- **Small drives prioritized**: <2TB = +40 points (maximum capacity gain)
|
|
- **Medium drives**: 2-5TB = +30 points, 5-10TB = +15 points
|
|
- **High utilization penalty**: >70% = -15 points (migration complexity)
|
|
- **Host balance bonus**: +15 points if below host average weight
|
|
|
|
### Resilience Score (10% weight)
|
|
|
|
([ceph_osd_analyzer.py:313-357](ceph_osd_analyzer.py#L313-L357))
|
|
|
|
- Hosts with >20% above average OSD count: +20 points
|
|
- Presence of down OSDs on same host: +15 points (hardware issues)
|
|
|
|
## Usage Patterns
|
|
|
|
### One-Line Execution (Recommended)
|
|
|
|
```bash
|
|
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd
|
|
```
|
|
|
|
**Why**: Always uses latest version, no local installation, integrates easily into automation.
|
|
|
|
### Command-Line Options
|
|
|
|
- `--class [hdd|nvme]`: Filter by device type
|
|
- `--min-size N`: Minimum OSD size in TB
|
|
- `--debug`: Enable verbose debugging output
|
|
|
|
### Typical Workflow
|
|
|
|
1. Run analysis during maintenance window
|
|
2. Identify top 3-5 candidates with scores >70
|
|
3. Review health issues and capacity gains
|
|
4. Plan replacement based on available hardware
|
|
5. Execute OSD out/destroy/replace operations
|
|
|
|
## Dependencies
|
|
|
|
### Required Packages
|
|
- Python 3.6+ (standard library only, no external dependencies)
|
|
- `smartmontools` package (`smartctl` binary)
|
|
- SSH access configured between all cluster nodes
|
|
|
|
### Required Permissions
|
|
- Ceph admin keyring access
|
|
- `sudo` privileges for SMART data retrieval
|
|
- SSH key-based authentication to all OSD hosts
|
|
|
|
### Ceph Commands Used
|
|
- `ceph osd tree -f json`: Cluster topology
|
|
- `ceph osd df -f json`: Disk usage statistics
|
|
- `ceph osd metadata osd.N -f json`: OSD device information
|
|
- `ceph device query-daemon-health-metrics osd.N`: SMART data
|
|
|
|
## Output Interpretation
|
|
|
|
### Replacement Score Ranges
|
|
- **70-100** (RED): Critical - immediate replacement recommended
|
|
- **50-69** (YELLOW): High priority - plan replacement soon
|
|
- **30-49**: Medium priority - next upgrade cycle
|
|
- **0-29** (GREEN): Low priority - healthy drives
|
|
|
|
### Health Score Ranges
|
|
- **80-100** (GREEN): Excellent condition
|
|
- **60-79** (YELLOW): Monitor for issues
|
|
- **40-59**: Fair - multiple concerns
|
|
- **0-39** (RED): Critical - replace urgently
|
|
|
|
## Common Issues & Solutions
|
|
|
|
### "No SMART data available"
|
|
- **Cause**: Missing `smartmontools` or insufficient permissions
|
|
- **Solution**: `apt install smartmontools` and verify sudo access
|
|
|
|
### SSH Timeout Errors
|
|
- **Cause**: Node unreachable or SSH keys not configured
|
|
- **Solution**: Verify connectivity with `ssh -o ConnectTimeout=5 <host> hostname`
|
|
|
|
### Device Path Resolution Failures
|
|
- **Cause**: Non-standard OSD deployment or encryption
|
|
- **Solution**: Enable `--debug` to see device resolution attempts
|
|
|
|
### dm-device Mapping Issues
|
|
- **Cause**: LVM or LUKS encrypted OSDs
|
|
- **Solution**: Script automatically resolves via `lsblk -no pkname`
|
|
|
|
## Development Notes
|
|
|
|
### Code Structure
|
|
- **Single file design**: Easier to execute remotely via `exec()`
|
|
- **Minimal dependencies**: Uses only Python standard library
|
|
- **Color-coded output**: ANSI escape codes for terminal display
|
|
- **Debug mode**: Comprehensive logging when `--debug` enabled
|
|
|
|
### Notable Functions
|
|
|
|
**`run_command()`** ([ceph_osd_analyzer.py:34-56](ceph_osd_analyzer.py#L34-L56)): Universal command executor with SSH support and JSON parsing
|
|
|
|
**`get_device_path_for_osd()`** ([ceph_osd_analyzer.py:84-122](ceph_osd_analyzer.py#L84-L122)): Complex device resolution logic handling metadata, symlinks, and dm-devices
|
|
|
|
**`get_smart_data_remote()`** ([ceph_osd_analyzer.py:124-145](ceph_osd_analyzer.py#L124-L145)): Remote SMART data collection with device type detection
|
|
|
|
**`parse_smart_health()`** ([ceph_osd_analyzer.py:173-269](ceph_osd_analyzer.py#L173-L269)): SMART attribute parsing with device-class-specific logic
|
|
|
|
### Future Enhancement Opportunities
|
|
|
|
1. **Parallel data collection**: Use threading for faster cluster-wide analysis
|
|
2. **Historical trending**: Track scores over time to predict failures
|
|
3. **JSON output mode**: For integration with monitoring systems
|
|
4. **Cost-benefit analysis**: Factor in replacement drive costs
|
|
5. **PG rebalance impact**: Estimate data movement required
|
|
|
|
## Security Considerations
|
|
|
|
### Permissions Required
|
|
- Root access for `smartctl` execution
|
|
- SSH access to all OSD hosts
|
|
- Ceph admin keyring (read-only sufficient)
|
|
|
|
### Network Requirements
|
|
- Script assumes SSH connectivity between nodes
|
|
- No outbound internet access required (internal-only tool)
|
|
- Hardcoded internal git server URL: `http://10.10.10.63:3000`
|
|
|
|
### SSH Configuration
|
|
- Uses `-o StrictHostKeyChecking=no` for automated execution
|
|
- 5-second connection timeout to handle unreachable nodes
|
|
- Assumes key-based authentication is configured
|
|
|
|
## Related Infrastructure
|
|
|
|
**Internal Git Server**: `http://10.10.10.63:3000/LotusGuild/analyzeOSDs`
|
|
|
|
**Related Projects**:
|
|
- hwmonDaemon: Hardware monitoring daemon for continuous health checks
|
|
- Other LotusGuild infrastructure automation tools
|
|
|
|
## Maintenance
|
|
|
|
### Version Control
|
|
- Maintained in internal git repository
|
|
- One-line execution always pulls from `main` branch
|
|
- No formal versioning; latest commit is production
|
|
|
|
### Testing Checklist
|
|
- [ ] Test on cluster with mixed HDD/NVMe OSDs
|
|
- [ ] Verify SSH connectivity to all hosts
|
|
- [ ] Confirm SMART data retrieval for both device types
|
|
- [ ] Validate dm-device resolution on encrypted OSDs
|
|
- [ ] Check output formatting with various terminal widths
|
|
- [ ] Test `--class` and `--min-size` filtering
|
|
|
|
## Performance Characteristics
|
|
|
|
**Execution Time**: ~5-15 seconds per OSD depending on cluster size and SSH latency
|
|
|
|
**Bottlenecks**:
|
|
- Serial OSD processing (parallelization would help)
|
|
- SSH round-trip times for SMART data
|
|
- SMART data parsing can be slow for unresponsive drives
|
|
|
|
**Resource Usage**: Minimal CPU/memory, I/O bound on SSH operations
|
|
|
|
**Intended Audience**: LotusGuild infrastructure team
|
|
|
|
**Support**: Submit issues or pull requests to internal git repository |