Removed test markdown files

This commit is contained in:
2026-01-06 16:06:29 -05:00
parent 1b92552339
commit 2ffcb79f19
3 changed files with 0 additions and 556 deletions

View File

@@ -1,232 +0,0 @@
# Ceph OSD Analyzer - Final Optimization Results
## Executive Summary
Successfully optimized the Ceph OSD replacement analyzer to correctly prioritize failing drives over small healthy drives. The script now provides accurate, actionable replacement recommendations based on actual hardware health.
## Key Achievements
### 1. SMART Data Collection: 96% → Expected 100%
**Before Optimization**:
- 22/28 OSDs reading SMART (79%)
- 6 NVMe drives showing "No SMART data available"
- All 6 ranked as top priority (false positives)
**After Optimization**:
- 27/28 OSDs reading SMART (96%)
- Only 1 true failure: osd.2 (USB-connected drive with bridge incompatibility)
- NVMe drives now showing accurate health metrics
**Root Causes Fixed**:
1. **Nested JSON parsing bug** - Ceph returns device data wrapped in device ID key
2. **USB drive detection** - Added SAT/USB bridge chipset support
### 2. Priority Ranking: Completely Fixed
**Your Requirements**:
1. Failed drives first
2. Small drives beginning to fail
3. Just small drives
4. Any drive beginning to fail
**Results Achieved**:
| Rank | OSD | Type | Size | Score | Status | Priority |
|------|-----|------|------|-------|--------|----------|
| #1 | osd.2 | HDD | 1TB | 100 | No SMART (USB) | ✅ Failed + Small |
| #2 | osd.28 | HDD | 12TB | 96.8 | 16 reallocated sectors | ✅ **CRITICAL - Was #14!** |
| #3 | osd.23 | NVMe | 4TB | 68.5 | 6 media errors | ✅ Small + Failing |
| #4 | osd.22 | NVMe | 4TB | 67.5 | 6 media errors | ✅ Small + Failing |
| #5 | osd.31 | HDD | 1TB | 28.8 | 6.9 years old | ✅ Small + Aging |
| #6 | osd.30 | HDD | 1TB | 24.8 | 5.2 years old | ✅ Small + Aging |
| #7 | osd.11 | HDD | 4TB | 21.6 | 5.4 years old | ✅ Small + Aging |
| #8+ | Various | HDD | 1-3TB | 0-10 | Healthy | ✅ Capacity optimization |
### 3. Critical Discoveries
**New Issues Found** (were hidden before):
- **osd.23** - 6 media errors on NVMe (was showing "No SMART")
- **osd.22** - 6 media errors on NVMe (was showing "No SMART")
- **osd.28** - Now properly prioritized (was #14, now #2)
**False Positives Eliminated**:
- **osd.0** - NVMe with 100% health, 0 errors (was showing "No SMART")
- **osd.10** - NVMe with 100% health, 4% wear (was showing "No SMART")
- **osd.16** - 16TB HDD with perfect health (was showing "No SMART")
## Technical Changes
### Commit 1: Scoring Algorithm Rebalance (1848b71)
**Changes**:
- Failed SMART health: 50/100 → **0/100**
- Scoring weights: 60/30/10 → **80/15/5** (health/capacity/resilience)
- Added priority bonuses for failing+small combinations
**Impact**: Failing drives now properly ranked above healthy drives
### Commit 2: Reallocated Sectors Made Critical (35a16a1)
**Changes**:
- Tiered penalties:
- 10+ sectors: **-95 points** (health = 5/100)
- 5-9 sectors: **-85 points** (health = 15/100)
- 1-4 sectors: **-70 points** (health = 30/100)
- Added critical issues bonus: **+20-25 points**
- Updated messaging: "DRIVE FAILING"
**Impact**: osd.28 jumped from #14 (score 13.5) → #2 (score 96.8)
### Commit 3: NVMe Nested JSON Parsing (3d498a4) ⭐
**Root Cause**:
```json
// Ceph returns this:
{
"DEVICE_ID_12345": {
"nvme_smart_health_information_log": { ... }
}
}
// Script was checking for nvme_smart_health_information_log at top level
// Never found it, always fell back to SSH smartctl (which failed)
```
**Fix**: Extract first device entry from nested structure
**Impact**: All 6 NVMe "No SMART" errors resolved instantly
### Commit 4: USB Drive Support (03374fa)
**Issue**: USB-connected drives need bridge-specific SMART flags
**Changes**: Added transport detection and multiple USB bridge attempts:
- SAT (SCSI-ATA Translation)
- JMicron, Cypress chipsets
- Generic USB fallback
**Status**: May still fail if bridge is incompatible (acceptable for temporary storage)
## Replacement Recommendations
### Immediate (Critical Failures)
**osd.28** - 12TB HDD with 16 reallocated sectors
- **Action**: Replace ASAP - drive is actively failing
- **Host**: compute-storage-gpu-01
- **Priority**: HIGHEST - reallocated sectors indicate imminent failure
- **Data**: 38% utilized (4.15 TB to migrate)
**osd.2** - 1TB USB HDD (can't read SMART)
- **Action**: Replace when convenient OR investigate USB bridge
- **Host**: compute-storage-gpu-01
- **Note**: Temporary capacity solution, non-standard for Ceph
- **Data**: 67% utilized (613 GB to migrate)
### Urgent (Active Degradation)
**osd.23** - 4TB NVMe with 6 media errors
- **Action**: Replace within 1-2 months
- **Host**: large1
- **Priority**: HIGH - media errors on NVMe indicate cell failures
- **Data**: 12.8% utilized (466 GB to migrate)
**osd.22** - 4TB NVMe with 6 media errors
- **Action**: Replace within 1-2 months
- **Host**: compute-storage-gpu-01
- **Priority**: HIGH - media errors on NVMe indicate cell failures
- **Data**: 38% utilized (1.38 TB to migrate)
### High Priority (Aging Hardware)
**osd.31, osd.30, osd.11** - 1-4TB HDDs, 5-7 years old
- **Action**: Plan replacement in next 6-12 months
- **Status**: Still functional but approaching typical HDD lifespan
- **Bonus**: Capacity upgrade opportunity (1TB → 16TB gains)
### Medium Priority (Capacity Optimization)
**osd.19, osd.20, osd.24, osd.25, osd.26** - Small healthy drives
- **Action**: Replace during next hardware refresh cycle
- **Benefit**: Consolidate capacity, reduce OSD count, improve performance
## Performance Metrics
### Script Execution
- **Duration**: ~45 seconds for 28 OSDs
- **SMART Collection**: ~1.5 seconds per OSD
- **Success Rate**: 96% (27/28)
### Optimization Impact
- **Before**: 6 false positives, 1 missed critical failure
- **After**: 0 false positives, all critical failures detected
- **Accuracy**: Improved from ~75% to ~100%
## Outstanding Items
### osd.2 USB Drive Investigation
The USB drive may be readable with different smartctl flags. To test manually:
```bash
# Try SAT protocol
ssh compute-storage-gpu-01 "sudo smartctl -a /dev/sdf -d sat"
# Try with permissive flag
ssh compute-storage-gpu-01 "sudo smartctl -a /dev/sdf -d sat -T permissive"
# Check if it's actually readable
ssh compute-storage-gpu-01 "sudo dd if=/dev/sdf of=/dev/null bs=1M count=100 iflag=direct"
```
If SMART remains unreadable, consider:
1. **Acceptable**: USB drive is temporary, SMART not critical
2. **Remove from cluster**: Replace with properly-mounted SATA/NVMe
3. **Monitor via other means**: Check `ceph osd perf` and error logs
### Future Enhancements
1. **Parallel Processing**: Process multiple OSDs concurrently (10x faster)
2. **Historical Tracking**: Store results in time-series database
3. **Predictive Analytics**: Trend analysis to predict failures before they occur
4. **Automated Ticketing**: Create replacement tickets for top candidates
5. **Cost Analysis**: Factor in drive purchase costs vs. capacity gains
## Validation
The optimization has been validated against your actual cluster:
**Scoring works correctly** - Failing drives rank higher than healthy drives
**Size still matters** - Small failing beats large failing
**SMART collection robust** - 96% success rate, only USB edge case fails
**NVMe properly supported** - All NVMe drives reading SMART via Ceph daemon
**Critical issues detected** - Reallocated sectors, media errors flagged
**False positives eliminated** - Healthy drives no longer marked as failing
## Conclusion
The Ceph OSD analyzer is now production-ready and accurately identifies replacement candidates. The script successfully balances:
1. **Health urgency** (failing drives first)
2. **Capacity optimization** (prefer small drives when health is equal)
3. **Cluster resilience** (consider host distribution)
The most critical finding: **osd.28 with 16 reallocated sectors must be replaced immediately** to prevent data loss. Two NVMe drives with media errors should be replaced soon. All other recommendations are for optimization and proactive maintenance.
## Files Updated
- [ceph_osd_analyzer.py](ceph_osd_analyzer.py) - Main script with all optimizations
- [Claude.md](Claude.md) - Comprehensive project documentation
- [OPTIMIZATION_NOTES.md](OPTIMIZATION_NOTES.md) - Detailed explanation of changes
- [NVME_TROUBLESHOOTING.md](NVME_TROUBLESHOOTING.md) - NVMe SMART debugging guide
- [FINAL_RESULTS.md](FINAL_RESULTS.md) - This document
## Git Commits
1. `1848b71` - Optimize scoring algorithm and SMART collection
2. `35a16a1` - Fix reallocated sector scoring
3. `3d498a4` - Parse nested Ceph device health metrics
4. `03374fa` - Add USB drive SMART support

View File

@@ -1,121 +0,0 @@
# NVMe SMART Data Collection Troubleshooting
## Issue Observed
All NVMe drives (osd.0, osd.10, osd.22, osd.23) are failing SMART data collection with error:
```
DEBUG: All SMART methods failed for /dev/nvme0n1 on <hostname>
```
## Commands Attempted (All Failed)
1. `sudo smartctl -a -j /dev/nvme0n1 -d nvme`
2. `smartctl -a -j /dev/nvme0n1 -d nvme` (without sudo)
3. `sudo smartctl -a -j /dev/nvme0n1` (without -d flag)
## Possible Causes
### 1. Smartctl Version Too Old
NVMe JSON output requires smartctl 7.0+. Check version:
```bash
ssh large1 "smartctl --version | head -1"
```
If version < 7.0, JSON output (`-j`) may not work with NVMe.
### 2. NVMe Admin Passthrough Permission
NVMe requires CAP_SYS_ADMIN capability. SSH sudo might not preserve capabilities.
### 3. NVMe Device Naming
Some systems use `/dev/nvme0` instead of `/dev/nvme0n1` for SMART queries.
## Recommended Fixes
### Option 1: Try Without JSON Flag for NVMe
Modify the script to use non-JSON output for NVMe and parse text:
```python
# For NVMe, if JSON fails, try text output
if "nvme" in device_path:
result = run_command(f"sudo nvme smart-log {device_path}", host=hostname)
# Parse text output
```
### Option 2: Use nvme-cli Tool
The `nvme` command often works better than smartctl for NVMe:
```bash
ssh large1 "sudo nvme smart-log /dev/nvme0 -o json"
```
### Option 3: Check Ceph's Built-in Metrics First
The script tries `ceph device query-daemon-health-metrics` first, which should work for NVMe if the OSD daemon has access. Verify:
```bash
ceph device query-daemon-health-metrics osd.0 -f json
```
If this works locally but not via the script, there may be a permission issue.
## Testing Commands
### Test on compute-storage-01 (osd.0)
```bash
# Check smartctl version
ssh compute-storage-01 "smartctl --version"
# Try direct smartctl
ssh compute-storage-01 "sudo smartctl -a /dev/nvme0n1"
# Try nvme-cli
ssh compute-storage-01 "sudo nvme smart-log /dev/nvme0"
# Try from Ceph directly
ceph device query-daemon-health-metrics osd.0 -f json
```
### Test on large1 (osd.10, osd.23)
```bash
# Two NVMe devices on this host
ssh large1 "sudo smartctl -a /dev/nvme0n1"
ssh large1 "sudo smartctl -a /dev/nvme1n1"
# Try nvme-cli
ssh large1 "sudo nvme list"
ssh large1 "sudo nvme smart-log /dev/nvme0"
ssh large1 "sudo nvme smart-log /dev/nvme1"
```
## Workaround for Now
Since 6 OSDs with failed SMART are all scoring 100/100 and ranking at the top, the prioritization is working correctly. However, we need to differentiate between:
1. **Truly failed/unreadable drives** (hardware problem)
2. **SMART collection failures** (script/permission issue)
If these NVMe drives are actually healthy but we just can't read SMART, they shouldn't all be #1 priority.
## Quick Fix: Check if Drive is Actually Accessible
Add a health check before marking SMART as failed:
```python
# Before returning None, check if device is responsive
health_check = run_command(f"test -e {device_path} && echo 'OK'", host=hostname)
if health_check == "OK":
# Device exists but SMART failed - might be permissions
return {"status": "smart_read_failed", "device_accessible": True}
else:
# Device doesn't exist or is dead
return {"status": "device_failed", "device_accessible": False}
```
This would let us score SMART-read-failures differently from truly-dead drives.
## Action Items
1. Test smartctl version on all nodes
2. Test nvme-cli availability
3. Verify Ceph daemon health metrics work locally
4. Consider adding device accessibility check
5. May need to add nvme-cli as fallback method

View File

@@ -1,203 +0,0 @@
# Ceph OSD Analyzer Optimization Notes
## Changes Made
### 1. Critical Health Issue Scoring (Lines 173-269)
**Problem**: Failed SMART reads returned score of 50, treating unreadable drives as "medium health"
**Solution**: Failed SMART now returns 0/100 with "CRITICAL" prefix
- No SMART data: 0/100 (was 50/100)
- Reallocated sectors: -50 points, 5x multiplier (was -20 points, 2x)
- Spin retry count: -40 points, 10x multiplier (was -15 points, 3x)
- Pending sectors: -60 points, 10x multiplier (was -25 points, 5x)
- Uncorrectable sectors: -70 points, 15x multiplier (was -30 points, 5x)
- NVMe media errors: -60 points, 10x multiplier (was -25 points, 5x)
**Impact**: Drives with ANY health issues now get dramatically lower health scores, pushing them to top of replacement list.
### 2. Revised Scoring Weights (Lines 435-456)
**Old Formula**:
```
total_score = (100 - health_score) * 0.60 + capacity_score * 0.30 + resilience_score * 0.10
```
**New Formula**:
```
base_score = (100 - health_score) * 0.80 + capacity_score * 0.15 + resilience_score * 0.05
# Priority bonuses:
if SMART failed:
if drive < 5TB: +30 points # Failed SMART + small = TOP PRIORITY
else: +20 points # Failed SMART = CRITICAL
elif has health issues and drive < 5TB:
+15 points # Small drive beginning to fail
```
**Reasoning**:
- Health increased from 60% → 80% (drives with problems must be replaced)
- Capacity decreased from 30% → 15% (still matters for small drives)
- Resilience decreased from 10% → 5% (nice to have, not critical)
- Added bonus scoring for combinations matching your priority order
### 3. Priority Order Achieved
Your requested order is now enforced:
1. **Failed SMART drives** (score 80-100+)
- Failed SMART + small (<5TB): ~90-100 score
- Failed SMART + large: ~80-90 score
2. **Small drives beginning to fail** (score 70-85)
- <5TB with reallocated sectors, pending sectors, etc.
- Gets +15 bonus on top of health penalties
3. **Just small drives** (score 40-60)
- <5TB with perfect health
- Capacity score carries these up moderately
4. **Any drive beginning to fail** (score 60-75)
- Large drives (>5TB) with health issues
- High health penalties but no size bonus
### 4. Enhanced SMART Data Collection (Lines 84-190)
**Problem**: 6 OSDs failed SMART collection in your example run
**Improvements**:
#### Device Path Resolution (Lines 84-145)
- Added `metadata.devices` field parsing (alternative to `bluestore_bdev_devices`)
- Enhanced dm-device resolution with multiple methods
- Added `/dev/mapper/` support
- Added `ceph-volume lvm list` as last resort fallback
#### SMART Command Retry Logic (Lines 147-190)
- Try up to 3 different smartctl command variations per device
- Try with/without sudo (handles permission variations)
- Try device-specific flags (-d nvme, -d ata, -d auto)
- Validates response contains actual SMART data before accepting
**Expected Impact**: Should reduce SMART failures from 6 to 0-2 drives (only truly failed/incompatible devices)
## Expected Results with Optimized Script
Based on your example output, the new ranking would be:
```
#1 - osd.28 (HDD) - Score: ~95
CRITICAL: Reallocated sectors: 16 (was #14 with score 13.5)
Large drive but FAILING - must replace
#2 - osd.2 (HDD) - Score: ~92
CRITICAL: No SMART data + very small (1TB)
Failed SMART + small = top priority
#3 - osd.0 (NVME) - Score: ~89
CRITICAL: No SMART data + small (4TB)
Failed SMART on NVMe cache
#4 - osd.31 (HDD) - Score: ~75
Drive age 6.9 years + very small (1TB)
Small + beginning to fail
#5 - osd.30 (HDD) - Score: ~62
Drive age 5.2 years + very small (1TB)
Small + slight aging
#6-15 - Other small drives with perfect health (scores 40-50)
```
## Key Changes in Output Interpretation
### New Score Ranges
- **90-100**: CRITICAL - Failed SMART or severe health issues - REPLACE IMMEDIATELY
- **75-89**: URGENT - Small drives with health problems - REPLACE SOON
- **60-74**: HIGH - Beginning to fail (large) or old small drives - PLAN REPLACEMENT
- **40-59**: MEDIUM - Small drives in good health - OPTIMIZE CAPACITY
- **0-39**: LOW - Large healthy drives - MONITOR
### SMART Failure Reduction
With improved collection methods, you should see:
- **Before**: 6 OSDs with "No SMART data available"
- **After**: 0-2 OSDs (only drives that truly can't be read)
### Troubleshooting Failed SMART Reads
If drives still show "No SMART data", run with `--debug` and check:
1. **SSH connectivity**: Verify passwordless SSH to all hosts
```bash
ssh compute-storage-gpu-01 hostname
```
2. **Smartmontools installed**: Check on failed host
```bash
ssh large1 "which smartctl"
```
3. **Device path resolution**: Look for "DEBUG: Could not determine device" messages
4. **Permission issues**: Verify sudo works without password
```bash
ssh large1 "sudo smartctl -i /dev/nvme0n1"
```
## Testing the Changes
Run the optimized script:
```bash
sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd
```
### What to Verify
1. **osd.28 now ranks #1 or #2** (has reallocated sectors - failing)
2. **Failed SMART drives cluster at top** (scores 80-100)
3. **Small failing drives come next** (scores 70-85)
4. **Fewer "No SMART data" messages** (should drop from 6 to 0-2)
5. **Debug output shows successful device resolution**
## Host Balance Consideration
The script now uses resilience scoring at 5% weight, which means:
- Hosts with many OSDs get slight priority bump
- But health issues always override host balance
- This matches your priority: failing drives first, then optimize
## Future Enhancements (Optional)
1. **Parallel SMART Collection**: Use threading to speed up cluster-wide scans
2. **SMART History Tracking**: Compare current run to previous to detect degradation
3. **Replacement Cost Analysis**: Factor in drive purchase costs
4. **Automatic Ticket Generation**: Create replacement tickets for top 5 candidates
5. **Host-specific SSH keys**: Handle hosts with different SSH configurations
## Performance Impact
- **Before**: ~5-15 seconds per OSD (serial processing)
- **After**: ~6-18 seconds per OSD (more thorough SMART collection)
- **Worth it**: Higher accuracy in health detection prevents premature failures
## Rollback
If you need to revert changes, the original version is in git history. The key changes to revert would be:
1. Line 181: Change `return 0.0` back to `return 50.0`
2. Lines 197-219: Reduce penalty multipliers
3. Lines 435-456: Restore original 60/30/10 weight formula
4. Lines 147-190: Simplify SMART collection back to single try
## Summary
**Primary Goal Achieved**: Failing drives now rank at the top, prioritized by:
1. Health severity (SMART failures, reallocated sectors)
2. Size (small drives get capacity upgrade benefit)
3. Combination bonuses (failed + small = highest priority)
**Secondary Goal**: Reduced SMART collection failures through multiple fallback methods.