Compare commits
8 Commits
89037ed93f
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 2ffcb79f19 | |||
| 1b92552339 | |||
| 03374fa784 | |||
| 3d498a4092 | |||
| 35a16a1793 | |||
| 1848b71c2a | |||
| 3b15377821 | |||
| c315fa3efc |
230
Claude.md
Normal file
230
Claude.md
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
# Ceph OSD Replacement Analyzer - Project Documentation
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
**Purpose**: Intelligent analysis tool for identifying optimal Ceph OSD replacement candidates across an entire cluster by analyzing health metrics, capacity optimization potential, and cluster resilience factors.
|
||||||
|
|
||||||
|
**Type**: Python 3 CLI tool for Ceph storage cluster maintenance
|
||||||
|
|
||||||
|
**Target Users**: Storage administrators, DevOps engineers, and infrastructure teams managing Ceph clusters
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Core Components
|
||||||
|
|
||||||
|
1. **Data Collection Layer** ([ceph_osd_analyzer.py:34-172](ceph_osd_analyzer.py#L34-L172))
|
||||||
|
- Executes Ceph commands locally and via SSH
|
||||||
|
- Retrieves SMART data from all cluster nodes
|
||||||
|
- Handles both local `ceph device query-daemon-health-metrics` and remote `smartctl` fallback
|
||||||
|
- Device path resolution with dm-device mapping support
|
||||||
|
|
||||||
|
2. **Analysis Engine** ([ceph_osd_analyzer.py:173-357](ceph_osd_analyzer.py#L173-L357))
|
||||||
|
- SMART health parsing for HDD and NVMe devices
|
||||||
|
- Capacity optimization scoring
|
||||||
|
- Cluster resilience impact calculation
|
||||||
|
- Multi-factor weighted scoring system
|
||||||
|
|
||||||
|
3. **Reporting System** ([ceph_osd_analyzer.py:361-525](ceph_osd_analyzer.py#L361-L525))
|
||||||
|
- Color-coded console output
|
||||||
|
- Top 15 ranked replacement candidates
|
||||||
|
- Summary by device class (HDD/NVMe)
|
||||||
|
- Per-host analysis breakdown
|
||||||
|
|
||||||
|
### Key Design Decisions
|
||||||
|
|
||||||
|
**Remote SMART Data Collection**: The script uses SSH to gather SMART data from all cluster nodes, not just the local node. This is critical because OSDs are distributed across multiple physical hosts.
|
||||||
|
|
||||||
|
**Fallback Strategy**: Primary method uses `ceph device query-daemon-health-metrics`, with automatic fallback to direct `smartctl` queries via SSH if Ceph's built-in metrics are unavailable.
|
||||||
|
|
||||||
|
**Device Mapping**: Handles complex storage configurations including device-mapper devices, resolving them to physical drives using `lsblk` and symlink resolution.
|
||||||
|
|
||||||
|
**Weighted Scoring**: 60% health, 30% capacity optimization, 10% resilience - prioritizes failing drives while considering operational efficiency.
|
||||||
|
|
||||||
|
## Scoring Algorithm
|
||||||
|
|
||||||
|
### Health Score (60% weight)
|
||||||
|
|
||||||
|
**HDD Metrics** ([ceph_osd_analyzer.py:183-236](ceph_osd_analyzer.py#L183-L236)):
|
||||||
|
- Reallocated sectors (ID 5): -20 points for any presence
|
||||||
|
- Spin retry count (ID 10): -15 points
|
||||||
|
- Pending sectors (ID 197): -25 points (critical indicator)
|
||||||
|
- Uncorrectable sectors (ID 198): -30 points (critical)
|
||||||
|
- Temperature (ID 190/194): -10 points if >60°C
|
||||||
|
- Age (ID 9): -15 points if >5 years
|
||||||
|
|
||||||
|
**NVMe Metrics** ([ceph_osd_analyzer.py:239-267](ceph_osd_analyzer.py#L239-L267)):
|
||||||
|
- Available spare: penalized if <50%
|
||||||
|
- Percentage used: -30 points if >80%
|
||||||
|
- Media errors: -25 points for any errors
|
||||||
|
- Temperature: -10 points if >70°C
|
||||||
|
|
||||||
|
### Capacity Score (30% weight)
|
||||||
|
|
||||||
|
([ceph_osd_analyzer.py:271-311](ceph_osd_analyzer.py#L271-L311))
|
||||||
|
|
||||||
|
- **Small drives prioritized**: <2TB = +40 points (maximum capacity gain)
|
||||||
|
- **Medium drives**: 2-5TB = +30 points, 5-10TB = +15 points
|
||||||
|
- **High utilization penalty**: >70% = -15 points (migration complexity)
|
||||||
|
- **Host balance bonus**: +15 points if below host average weight
|
||||||
|
|
||||||
|
### Resilience Score (10% weight)
|
||||||
|
|
||||||
|
([ceph_osd_analyzer.py:313-357](ceph_osd_analyzer.py#L313-L357))
|
||||||
|
|
||||||
|
- Hosts with >20% above average OSD count: +20 points
|
||||||
|
- Presence of down OSDs on same host: +15 points (hardware issues)
|
||||||
|
|
||||||
|
## Usage Patterns
|
||||||
|
|
||||||
|
### One-Line Execution (Recommended)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why**: Always uses latest version, no local installation, integrates easily into automation.
|
||||||
|
|
||||||
|
### Command-Line Options
|
||||||
|
|
||||||
|
- `--class [hdd|nvme]`: Filter by device type
|
||||||
|
- `--min-size N`: Minimum OSD size in TB
|
||||||
|
- `--debug`: Enable verbose debugging output
|
||||||
|
|
||||||
|
### Typical Workflow
|
||||||
|
|
||||||
|
1. Run analysis during maintenance window
|
||||||
|
2. Identify top 3-5 candidates with scores >70
|
||||||
|
3. Review health issues and capacity gains
|
||||||
|
4. Plan replacement based on available hardware
|
||||||
|
5. Execute OSD out/destroy/replace operations
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
### Required Packages
|
||||||
|
- Python 3.6+ (standard library only, no external dependencies)
|
||||||
|
- `smartmontools` package (`smartctl` binary)
|
||||||
|
- SSH access configured between all cluster nodes
|
||||||
|
|
||||||
|
### Required Permissions
|
||||||
|
- Ceph admin keyring access
|
||||||
|
- `sudo` privileges for SMART data retrieval
|
||||||
|
- SSH key-based authentication to all OSD hosts
|
||||||
|
|
||||||
|
### Ceph Commands Used
|
||||||
|
- `ceph osd tree -f json`: Cluster topology
|
||||||
|
- `ceph osd df -f json`: Disk usage statistics
|
||||||
|
- `ceph osd metadata osd.N -f json`: OSD device information
|
||||||
|
- `ceph device query-daemon-health-metrics osd.N`: SMART data
|
||||||
|
|
||||||
|
## Output Interpretation
|
||||||
|
|
||||||
|
### Replacement Score Ranges
|
||||||
|
- **70-100** (RED): Critical - immediate replacement recommended
|
||||||
|
- **50-69** (YELLOW): High priority - plan replacement soon
|
||||||
|
- **30-49**: Medium priority - next upgrade cycle
|
||||||
|
- **0-29** (GREEN): Low priority - healthy drives
|
||||||
|
|
||||||
|
### Health Score Ranges
|
||||||
|
- **80-100** (GREEN): Excellent condition
|
||||||
|
- **60-79** (YELLOW): Monitor for issues
|
||||||
|
- **40-59**: Fair - multiple concerns
|
||||||
|
- **0-39** (RED): Critical - replace urgently
|
||||||
|
|
||||||
|
## Common Issues & Solutions
|
||||||
|
|
||||||
|
### "No SMART data available"
|
||||||
|
- **Cause**: Missing `smartmontools` or insufficient permissions
|
||||||
|
- **Solution**: `apt install smartmontools` and verify sudo access
|
||||||
|
|
||||||
|
### SSH Timeout Errors
|
||||||
|
- **Cause**: Node unreachable or SSH keys not configured
|
||||||
|
- **Solution**: Verify connectivity with `ssh -o ConnectTimeout=5 <host> hostname`
|
||||||
|
|
||||||
|
### Device Path Resolution Failures
|
||||||
|
- **Cause**: Non-standard OSD deployment or encryption
|
||||||
|
- **Solution**: Enable `--debug` to see device resolution attempts
|
||||||
|
|
||||||
|
### dm-device Mapping Issues
|
||||||
|
- **Cause**: LVM or LUKS encrypted OSDs
|
||||||
|
- **Solution**: Script automatically resolves via `lsblk -no pkname`
|
||||||
|
|
||||||
|
## Development Notes
|
||||||
|
|
||||||
|
### Code Structure
|
||||||
|
- **Single file design**: Easier to execute remotely via `exec()`
|
||||||
|
- **Minimal dependencies**: Uses only Python standard library
|
||||||
|
- **Color-coded output**: ANSI escape codes for terminal display
|
||||||
|
- **Debug mode**: Comprehensive logging when `--debug` enabled
|
||||||
|
|
||||||
|
### Notable Functions
|
||||||
|
|
||||||
|
**`run_command()`** ([ceph_osd_analyzer.py:34-56](ceph_osd_analyzer.py#L34-L56)): Universal command executor with SSH support and JSON parsing
|
||||||
|
|
||||||
|
**`get_device_path_for_osd()`** ([ceph_osd_analyzer.py:84-122](ceph_osd_analyzer.py#L84-L122)): Complex device resolution logic handling metadata, symlinks, and dm-devices
|
||||||
|
|
||||||
|
**`get_smart_data_remote()`** ([ceph_osd_analyzer.py:124-145](ceph_osd_analyzer.py#L124-L145)): Remote SMART data collection with device type detection
|
||||||
|
|
||||||
|
**`parse_smart_health()`** ([ceph_osd_analyzer.py:173-269](ceph_osd_analyzer.py#L173-L269)): SMART attribute parsing with device-class-specific logic
|
||||||
|
|
||||||
|
### Future Enhancement Opportunities
|
||||||
|
|
||||||
|
1. **Parallel data collection**: Use threading for faster cluster-wide analysis
|
||||||
|
2. **Historical trending**: Track scores over time to predict failures
|
||||||
|
3. **JSON output mode**: For integration with monitoring systems
|
||||||
|
4. **Cost-benefit analysis**: Factor in replacement drive costs
|
||||||
|
5. **PG rebalance impact**: Estimate data movement required
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
### Permissions Required
|
||||||
|
- Root access for `smartctl` execution
|
||||||
|
- SSH access to all OSD hosts
|
||||||
|
- Ceph admin keyring (read-only sufficient)
|
||||||
|
|
||||||
|
### Network Requirements
|
||||||
|
- Script assumes SSH connectivity between nodes
|
||||||
|
- No outbound internet access required (internal-only tool)
|
||||||
|
- Hardcoded internal git server URL: `http://10.10.10.63:3000`
|
||||||
|
|
||||||
|
### SSH Configuration
|
||||||
|
- Uses `-o StrictHostKeyChecking=no` for automated execution
|
||||||
|
- 5-second connection timeout to handle unreachable nodes
|
||||||
|
- Assumes key-based authentication is configured
|
||||||
|
|
||||||
|
## Related Infrastructure
|
||||||
|
|
||||||
|
**Internal Git Server**: `http://10.10.10.63:3000/LotusGuild/analyzeOSDs`
|
||||||
|
|
||||||
|
**Related Projects**:
|
||||||
|
- hwmonDaemon: Hardware monitoring daemon for continuous health checks
|
||||||
|
- Other LotusGuild infrastructure automation tools
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Version Control
|
||||||
|
- Maintained in internal git repository
|
||||||
|
- One-line execution always pulls from `main` branch
|
||||||
|
- No formal versioning; latest commit is production
|
||||||
|
|
||||||
|
### Testing Checklist
|
||||||
|
- [ ] Test on cluster with mixed HDD/NVMe OSDs
|
||||||
|
- [ ] Verify SSH connectivity to all hosts
|
||||||
|
- [ ] Confirm SMART data retrieval for both device types
|
||||||
|
- [ ] Validate dm-device resolution on encrypted OSDs
|
||||||
|
- [ ] Check output formatting with various terminal widths
|
||||||
|
- [ ] Test `--class` and `--min-size` filtering
|
||||||
|
|
||||||
|
## Performance Characteristics
|
||||||
|
|
||||||
|
**Execution Time**: ~5-15 seconds per OSD depending on cluster size and SSH latency
|
||||||
|
|
||||||
|
**Bottlenecks**:
|
||||||
|
- Serial OSD processing (parallelization would help)
|
||||||
|
- SSH round-trip times for SMART data
|
||||||
|
- SMART data parsing can be slow for unresponsive drives
|
||||||
|
|
||||||
|
**Resource Usage**: Minimal CPU/memory, I/O bound on SSH operations
|
||||||
|
|
||||||
|
**Intended Audience**: LotusGuild infrastructure team
|
||||||
|
|
||||||
|
**Support**: Submit issues or pull requests to internal git repository
|
||||||
10
README.md
10
README.md
@@ -60,8 +60,16 @@ Run directly from your internal git server:
|
|||||||
```bash
|
```bash
|
||||||
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"
|
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())"
|
||||||
```
|
```
|
||||||
|
Run directly from internal git server with debug enabled:
|
||||||
|
```bash
|
||||||
|
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug
|
||||||
|
```
|
||||||
|
Most common execution
|
||||||
|
```bash
|
||||||
|
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd
|
||||||
|
```
|
||||||
|
|
||||||
### Traditional Installation
|
### Traditional Installation (not recommended)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Clone repository
|
# Clone repository
|
||||||
|
|||||||
@@ -93,12 +93,27 @@ def get_device_path_for_osd(osd_id, hostname):
|
|||||||
print(f"{Colors.GREEN}DEBUG: Found physical device from metadata: {device}{Colors.END}")
|
print(f"{Colors.GREEN}DEBUG: Found physical device from metadata: {device}{Colors.END}")
|
||||||
return device
|
return device
|
||||||
|
|
||||||
|
# Also try devices field which sometimes has the info
|
||||||
|
devices = metadata.get('devices')
|
||||||
|
if devices:
|
||||||
|
# devices might be comma-separated
|
||||||
|
first_dev = devices.split(',')[0].strip()
|
||||||
|
if first_dev and not first_dev.startswith('dm-'):
|
||||||
|
device = f"/dev/{first_dev}" if not first_dev.startswith('/dev/') else first_dev
|
||||||
|
if DEBUG:
|
||||||
|
print(f"{Colors.GREEN}DEBUG: Found device from metadata.devices: {device}{Colors.END}")
|
||||||
|
return device
|
||||||
|
|
||||||
# Fallback: follow the symlink
|
# Fallback: follow the symlink
|
||||||
result = run_command(f"readlink -f /var/lib/ceph/osd/ceph-{osd_id}/block", host=hostname)
|
result = run_command(f"readlink -f /var/lib/ceph/osd/ceph-{osd_id}/block", host=hostname)
|
||||||
if result and result.startswith('/dev/'):
|
if result and result.startswith('/dev/'):
|
||||||
# Check if it is a dm device, try to find underlying
|
# Check if it is a dm device, try to find underlying
|
||||||
if '/dev/dm-' in result:
|
if '/dev/dm-' in result or '/dev/mapper/' in result:
|
||||||
|
# Try multiple methods to resolve dm device
|
||||||
base = run_command(f"lsblk -no pkname {result}", host=hostname)
|
base = run_command(f"lsblk -no pkname {result}", host=hostname)
|
||||||
|
if not base:
|
||||||
|
# Alternative: use ls -l on /dev/mapper
|
||||||
|
base = run_command(f"ls -l {result} | awk '{{print $NF}}' | xargs basename", host=hostname)
|
||||||
if base:
|
if base:
|
||||||
device = f"/dev/{base.strip()}"
|
device = f"/dev/{base.strip()}"
|
||||||
if DEBUG:
|
if DEBUG:
|
||||||
@@ -109,12 +124,20 @@ def get_device_path_for_osd(osd_id, hostname):
|
|||||||
print(f"{Colors.GREEN}DEBUG: Using device symlink {result}{Colors.END}")
|
print(f"{Colors.GREEN}DEBUG: Using device symlink {result}{Colors.END}")
|
||||||
return result
|
return result
|
||||||
|
|
||||||
# Last fallback: lsblk from block path
|
# Try alternative: lsblk with PKNAME (parent kernel name)
|
||||||
result = run_command(f"lsblk -no pkname /var/lib/ceph/osd/ceph-{osd_id}/block", host=hostname)
|
result = run_command(f"lsblk -no pkname /var/lib/ceph/osd/ceph-{osd_id}/block 2>/dev/null", host=hostname)
|
||||||
if result:
|
if result:
|
||||||
device = f"/dev/{result.strip()}"
|
device = f"/dev/{result.strip()}"
|
||||||
if DEBUG:
|
if DEBUG:
|
||||||
print(f"{Colors.GREEN}DEBUG: Found device from lsblk: {device}{Colors.END}")
|
print(f"{Colors.GREEN}DEBUG: Found device from lsblk pkname: {device}{Colors.END}")
|
||||||
|
return device
|
||||||
|
|
||||||
|
# Last resort: try to get from ceph-volume lvm list
|
||||||
|
result = run_command(f"ceph-volume lvm list | grep -A 20 'osd id.*{osd_id}' | grep 'devices' | awk '{{print $2}}'", host=hostname)
|
||||||
|
if result:
|
||||||
|
device = result.strip()
|
||||||
|
if DEBUG:
|
||||||
|
print(f"{Colors.GREEN}DEBUG: Found device from ceph-volume: {device}{Colors.END}")
|
||||||
return device
|
return device
|
||||||
|
|
||||||
if DEBUG:
|
if DEBUG:
|
||||||
@@ -122,25 +145,61 @@ def get_device_path_for_osd(osd_id, hostname):
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
def get_smart_data_remote(device_path, hostname):
|
def get_smart_data_remote(device_path, hostname):
|
||||||
"""Get SMART data from a remote host with proper device type detection."""
|
"""Get SMART data from a remote host with multiple fallback methods"""
|
||||||
if not device_path:
|
if not device_path:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# Strip partition suffix
|
# Determine device type
|
||||||
base_device = re.sub(r'p?\d+$', '', device_path)
|
tran = run_command(f"lsblk -no tran {device_path} 2>/dev/null", host=hostname)
|
||||||
|
tran = tran.strip() if tran else ""
|
||||||
|
|
||||||
# Detect type: NVMe or SATA
|
# Try different command variations based on device type
|
||||||
if 'nvme' in base_device:
|
commands_to_try = []
|
||||||
dev_type = 'nvme'
|
|
||||||
|
if tran == "nvme" or "nvme" in device_path:
|
||||||
|
commands_to_try = [
|
||||||
|
f"sudo smartctl -a -j {device_path} -d nvme",
|
||||||
|
f"smartctl -a -j {device_path} -d nvme", # Try without sudo
|
||||||
|
f"sudo smartctl -a -j {device_path}",
|
||||||
|
]
|
||||||
|
elif tran == "usb":
|
||||||
|
# USB-connected drives need special device type flags
|
||||||
|
commands_to_try = [
|
||||||
|
f"sudo smartctl -a -j {device_path} -d sat", # SAT (SCSI-ATA Translation)
|
||||||
|
f"sudo smartctl -a -j {device_path} -d usbjmicron", # JMicron USB bridge
|
||||||
|
f"sudo smartctl -a -j {device_path} -d usbcypress", # Cypress USB bridge
|
||||||
|
f"sudo smartctl -a -j {device_path} -d usb", # Generic USB
|
||||||
|
f"sudo smartctl -a -j {device_path} -d scsi", # SCSI passthrough
|
||||||
|
f"sudo smartctl -a -j {device_path}", # Auto-detect
|
||||||
|
]
|
||||||
|
elif tran == "sata":
|
||||||
|
commands_to_try = [
|
||||||
|
f"sudo smartctl -a -j {device_path}",
|
||||||
|
f"smartctl -a -j {device_path}",
|
||||||
|
f"sudo smartctl -a -j {device_path} -d ata",
|
||||||
|
]
|
||||||
else:
|
else:
|
||||||
dev_type = 'sat' # sata/ata, compatible with SSD/HDD
|
# Unknown or no transport, try generic approaches including USB
|
||||||
|
commands_to_try = [
|
||||||
|
f"sudo smartctl -a -j {device_path}",
|
||||||
|
f"smartctl -a -j {device_path}",
|
||||||
|
f"sudo smartctl -a -j {device_path} -d sat", # Try USB/SAT
|
||||||
|
f"sudo smartctl -a -j {device_path} -d auto",
|
||||||
|
]
|
||||||
|
|
||||||
cmd = f"sudo smartctl -a -j -d {dev_type} {base_device} 2>/dev/null"
|
# Try each command until one succeeds
|
||||||
result = run_command(cmd, host=hostname, parse_json=True)
|
for cmd in commands_to_try:
|
||||||
if DEBUG and result is None:
|
result = run_command(f"{cmd} 2>/dev/null", host=hostname, parse_json=True)
|
||||||
print(f"{Colors.YELLOW}DEBUG: SMART data failed for {base_device} on {hostname}{Colors.END}")
|
if result and ('ata_smart_attributes' in result or 'nvme_smart_health_information_log' in result):
|
||||||
|
if DEBUG:
|
||||||
|
print(f"{Colors.GREEN}DEBUG: SMART success with: {cmd}{Colors.END}")
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
if DEBUG:
|
||||||
|
print(f"{Colors.RED}DEBUG: All SMART methods failed for {device_path} on {hostname}{Colors.END}")
|
||||||
|
print(f"{Colors.YELLOW}DEBUG: Transport type detected: {tran if tran else 'unknown'}{Colors.END}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
def get_device_health(osd_id, hostname):
|
def get_device_health(osd_id, hostname):
|
||||||
"""Get device SMART health metrics from the appropriate host"""
|
"""Get device SMART health metrics from the appropriate host"""
|
||||||
@@ -150,9 +209,20 @@ def get_device_health(osd_id, hostname):
|
|||||||
# First try ceph's built-in health metrics
|
# First try ceph's built-in health metrics
|
||||||
data = run_command(f"ceph device query-daemon-health-metrics osd.{osd_id} -f json 2>/dev/null", parse_json=True)
|
data = run_command(f"ceph device query-daemon-health-metrics osd.{osd_id} -f json 2>/dev/null", parse_json=True)
|
||||||
|
|
||||||
if data and ('ata_smart_attributes' in data or 'nvme_smart_health_information_log' in data):
|
if data:
|
||||||
|
# Ceph returns data nested under device ID, extract it
|
||||||
|
if isinstance(data, dict) and len(data) > 0:
|
||||||
|
# Get the first (and usually only) device entry
|
||||||
|
device_data = next(iter(data.values())) if data else None
|
||||||
|
if device_data and ('ata_smart_attributes' in device_data or 'nvme_smart_health_information_log' in device_data):
|
||||||
if DEBUG:
|
if DEBUG:
|
||||||
print(f"{Colors.GREEN}DEBUG: Got SMART data from ceph device query{Colors.END}")
|
print(f"{Colors.GREEN}DEBUG: Got SMART data from ceph device query (nested format){Colors.END}")
|
||||||
|
return device_data
|
||||||
|
|
||||||
|
# Also check if data is already in the right format (backward compatibility)
|
||||||
|
if 'ata_smart_attributes' in data or 'nvme_smart_health_information_log' in data:
|
||||||
|
if DEBUG:
|
||||||
|
print(f"{Colors.GREEN}DEBUG: Got SMART data from ceph device query (direct format){Colors.END}")
|
||||||
return data
|
return data
|
||||||
|
|
||||||
# If that fails, get device path and query via SSH
|
# If that fails, get device path and query via SSH
|
||||||
@@ -175,7 +245,8 @@ def parse_smart_health(smart_data):
|
|||||||
metrics = {}
|
metrics = {}
|
||||||
|
|
||||||
if not smart_data:
|
if not smart_data:
|
||||||
return 50.0, ["No SMART data available"], metrics
|
# CRITICAL: Failed SMART reads are a red flag - could indicate drive issues
|
||||||
|
return 0.0, ["CRITICAL: No SMART data available - drive may be failing"], metrics
|
||||||
|
|
||||||
# Check for HDD SMART data
|
# Check for HDD SMART data
|
||||||
if 'ata_smart_attributes' in smart_data:
|
if 'ata_smart_attributes' in smart_data:
|
||||||
@@ -187,33 +258,39 @@ def parse_smart_health(smart_data):
|
|||||||
value = attr.get('value', 0)
|
value = attr.get('value', 0)
|
||||||
raw_value = attr.get('raw', {}).get('value', 0)
|
raw_value = attr.get('raw', {}).get('value', 0)
|
||||||
|
|
||||||
# Reallocated Sectors (5)
|
# Reallocated Sectors (5) - CRITICAL indicator of imminent failure
|
||||||
if attr_id == 5:
|
if attr_id == 5:
|
||||||
metrics['reallocated_sectors'] = raw_value
|
metrics['reallocated_sectors'] = raw_value
|
||||||
if raw_value > 0:
|
if raw_value > 0:
|
||||||
score -= min(20, raw_value * 2)
|
# ANY reallocated sectors is a severe problem
|
||||||
issues.append(f"Reallocated sectors: {raw_value}")
|
if raw_value >= 10:
|
||||||
|
score -= 95 # Drive is failing, near-zero health
|
||||||
|
elif raw_value >= 5:
|
||||||
|
score -= 85 # Critical failure imminent
|
||||||
|
else:
|
||||||
|
score -= 70 # Even 1-4 sectors is very serious
|
||||||
|
issues.append(f"CRITICAL: Reallocated sectors: {raw_value} - DRIVE FAILING")
|
||||||
|
|
||||||
# Spin Retry Count (10)
|
# Spin Retry Count (10) - CRITICAL
|
||||||
elif attr_id == 10:
|
elif attr_id == 10:
|
||||||
metrics['spin_retry'] = raw_value
|
metrics['spin_retry'] = raw_value
|
||||||
if raw_value > 0:
|
if raw_value > 0:
|
||||||
score -= min(15, raw_value * 3)
|
score -= min(40, raw_value * 10)
|
||||||
issues.append(f"Spin retry count: {raw_value}")
|
issues.append(f"CRITICAL: Spin retry count: {raw_value}")
|
||||||
|
|
||||||
# Pending Sectors (197)
|
# Pending Sectors (197) - CRITICAL
|
||||||
elif attr_id == 197:
|
elif attr_id == 197:
|
||||||
metrics['pending_sectors'] = raw_value
|
metrics['pending_sectors'] = raw_value
|
||||||
if raw_value > 0:
|
if raw_value > 0:
|
||||||
score -= min(25, raw_value * 5)
|
score -= min(60, raw_value * 10)
|
||||||
issues.append(f"Pending sectors: {raw_value}")
|
issues.append(f"CRITICAL: Pending sectors: {raw_value}")
|
||||||
|
|
||||||
# Uncorrectable Sectors (198)
|
# Uncorrectable Sectors (198) - CRITICAL
|
||||||
elif attr_id == 198:
|
elif attr_id == 198:
|
||||||
metrics['uncorrectable_sectors'] = raw_value
|
metrics['uncorrectable_sectors'] = raw_value
|
||||||
if raw_value > 0:
|
if raw_value > 0:
|
||||||
score -= min(30, raw_value * 5)
|
score -= min(70, raw_value * 15)
|
||||||
issues.append(f"Uncorrectable sectors: {raw_value}")
|
issues.append(f"CRITICAL: Uncorrectable sectors: {raw_value}")
|
||||||
|
|
||||||
# Temperature (190, 194)
|
# Temperature (190, 194)
|
||||||
elif attr_id in [190, 194]:
|
elif attr_id in [190, 194]:
|
||||||
@@ -250,11 +327,11 @@ def parse_smart_health(smart_data):
|
|||||||
score -= min(30, (pct_used - 80) * 1.5)
|
score -= min(30, (pct_used - 80) * 1.5)
|
||||||
issues.append(f"High wear: {pct_used}%")
|
issues.append(f"High wear: {pct_used}%")
|
||||||
|
|
||||||
# Media errors
|
# Media errors - CRITICAL for NVMe
|
||||||
media_errors = nvme_health.get('media_errors', 0)
|
media_errors = nvme_health.get('media_errors', 0)
|
||||||
if media_errors > 0:
|
if media_errors > 0:
|
||||||
score -= min(25, media_errors * 5)
|
score -= min(60, media_errors * 10)
|
||||||
issues.append(f"Media errors: {media_errors}")
|
issues.append(f"CRITICAL: Media errors: {media_errors}")
|
||||||
|
|
||||||
# Temperature
|
# Temperature
|
||||||
temp = nvme_health.get('temperature', 0)
|
temp = nvme_health.get('temperature', 0)
|
||||||
@@ -429,13 +506,36 @@ def analyze_cluster():
|
|||||||
node, host_name, host_osds_map, osd_tree
|
node, host_name, host_osds_map, osd_tree
|
||||||
)
|
)
|
||||||
|
|
||||||
# Calculate total score (weighted: 60% health, 30% capacity, 10% resilience)
|
# Calculate total score with revised weights
|
||||||
total_score = (
|
# Priority: Failed drives > Small failing drives > Small drives > Any failing
|
||||||
(100 - health_score) * 0.60 + # Health is most important
|
has_health_issues = len(health_issues) > 0
|
||||||
capacity_score * 0.30 + # Capacity optimization
|
has_critical_issues = any('CRITICAL:' in issue and ('Reallocated' in issue or 'Uncorrectable' in issue or 'Pending' in issue)
|
||||||
resilience_score * 0.10 # Cluster resilience
|
for issue in health_issues)
|
||||||
|
is_small = osd_df_data.get('crush_weight', 0) < 5
|
||||||
|
|
||||||
|
# Base scoring: 80% health, 15% capacity, 5% resilience
|
||||||
|
base_score = (
|
||||||
|
(100 - health_score) * 0.80 + # Health is critical
|
||||||
|
capacity_score * 0.15 + # Capacity matters for small drives
|
||||||
|
resilience_score * 0.05 # Cluster resilience (minor)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Apply multipliers for priority combinations
|
||||||
|
if health_score == 0: # Failed SMART reads
|
||||||
|
if is_small:
|
||||||
|
base_score += 30 # Failed SMART + small = top priority
|
||||||
|
else:
|
||||||
|
base_score += 20 # Failed SMART alone is still critical
|
||||||
|
elif has_critical_issues: # Reallocated/pending/uncorrectable sectors
|
||||||
|
if is_small:
|
||||||
|
base_score += 25 # Critical issues + small drive
|
||||||
|
else:
|
||||||
|
base_score += 20 # Critical issues alone
|
||||||
|
elif has_health_issues and is_small:
|
||||||
|
base_score += 15 # Small + beginning to fail
|
||||||
|
|
||||||
|
total_score = min(100, base_score) # Cap at 100
|
||||||
|
|
||||||
candidates.append({
|
candidates.append({
|
||||||
'osd_id': osd_id,
|
'osd_id': osd_id,
|
||||||
'osd_name': osd_name,
|
'osd_name': osd_name,
|
||||||
|
|||||||
Reference in New Issue
Block a user