Files

Jared Vititoe 1848b71c2a Optimize OSD analyzer: prioritize failing drives and improve SMART collection

Major improvements to scoring and data collection:

**Scoring Changes:**
- Failed SMART reads now return 0/100 health (was 50/100)
- Critical health issues get much higher penalties:
  * Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x)
  * Pending sectors: -60 pts, 10x multiplier (was -25, 5x)
  * Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x)
  * NVMe media errors: -60 pts, 10x multiplier (was -25, 5x)
- Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10)
- Added priority bonuses:
  * Failed SMART + small drive (<5TB): +30 points
  * Failed SMART alone: +20 points
  * Health issues + small drive: +15 points

**Priority Order Now Enforced:**
1. Failed SMART drives (score 90-100)
2. Small drives beginning to fail (70-85)
3. Small healthy drives (40-60)
4. Large failing drives (60-75)

**Enhanced SMART Collection:**
- Added metadata.devices field parsing
- Enhanced dm-device and /dev/mapper/ resolution
- Added ceph-volume lvm list fallback
- Retry logic with 3 command variations per device
- Try with/without sudo, different device flags

**Expected Impact:**
- osd.28 with reallocated sectors jumps from #14 to top 3
- SMART collection failures should drop from 6 to 0-2
- All failing drives rank above healthy drives regardless of size

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-06 15:05:25 -05:00

8.7 KiB

Raw Blame History

Ceph OSD Replacement Analyzer - Project Documentation

Project Overview

Purpose: Intelligent analysis tool for identifying optimal Ceph OSD replacement candidates across an entire cluster by analyzing health metrics, capacity optimization potential, and cluster resilience factors.

Type: Python 3 CLI tool for Ceph storage cluster maintenance

Target Users: Storage administrators, DevOps engineers, and infrastructure teams managing Ceph clusters

Architecture

Core Components

Data Collection Layer (ceph_osd_analyzer.py:34-172)
- Executes Ceph commands locally and via SSH
- Retrieves SMART data from all cluster nodes
- Handles both local ceph device query-daemon-health-metrics and remote smartctl fallback
- Device path resolution with dm-device mapping support
Analysis Engine (ceph_osd_analyzer.py:173-357)
- SMART health parsing for HDD and NVMe devices
- Capacity optimization scoring
- Cluster resilience impact calculation
- Multi-factor weighted scoring system
Reporting System (ceph_osd_analyzer.py:361-525)
- Color-coded console output
- Top 15 ranked replacement candidates
- Summary by device class (HDD/NVMe)
- Per-host analysis breakdown

Key Design Decisions

Remote SMART Data Collection: The script uses SSH to gather SMART data from all cluster nodes, not just the local node. This is critical because OSDs are distributed across multiple physical hosts.

Fallback Strategy: Primary method uses ceph device query-daemon-health-metrics, with automatic fallback to direct smartctl queries via SSH if Ceph's built-in metrics are unavailable.

Device Mapping: Handles complex storage configurations including device-mapper devices, resolving them to physical drives using lsblk and symlink resolution.

Weighted Scoring: 60% health, 30% capacity optimization, 10% resilience - prioritizes failing drives while considering operational efficiency.

Scoring Algorithm

Health Score (60% weight)

HDD Metrics (ceph_osd_analyzer.py:183-236):

Reallocated sectors (ID 5): -20 points for any presence
Spin retry count (ID 10): -15 points
Pending sectors (ID 197): -25 points (critical indicator)
Uncorrectable sectors (ID 198): -30 points (critical)
Temperature (ID 190/194): -10 points if >60°C
Age (ID 9): -15 points if >5 years

NVMe Metrics (ceph_osd_analyzer.py:239-267):

Available spare: penalized if <50%
Percentage used: -30 points if >80%
Media errors: -25 points for any errors
Temperature: -10 points if >70°C

Capacity Score (30% weight)

(ceph_osd_analyzer.py:271-311)

Small drives prioritized: <2TB = +40 points (maximum capacity gain)
Medium drives: 2-5TB = +30 points, 5-10TB = +15 points
High utilization penalty: >70% = -15 points (migration complexity)
Host balance bonus: +15 points if below host average weight

Resilience Score (10% weight)

(ceph_osd_analyzer.py:313-357)

Hosts with >20% above average OSD count: +20 points
Presence of down OSDs on same host: +15 points (hardware issues)

Usage Patterns

One-Line Execution (Recommended)

sudo python3 -c "import urllib.request; exec(urllib.request.urlopen('http://10.10.10.63:3000/LotusGuild/analyzeOSDs/raw/branch/main/ceph_osd_analyzer.py').read().decode())" --debug --class hdd

Why: Always uses latest version, no local installation, integrates easily into automation.

Command-Line Options

--class [hdd|nvme]: Filter by device type
--min-size N: Minimum OSD size in TB
--debug: Enable verbose debugging output

Typical Workflow

Run analysis during maintenance window
Identify top 3-5 candidates with scores >70
Review health issues and capacity gains
Plan replacement based on available hardware
Execute OSD out/destroy/replace operations

Dependencies

Required Packages

Python 3.6+ (standard library only, no external dependencies)
smartmontools package (smartctl binary)
SSH access configured between all cluster nodes

Required Permissions

Ceph admin keyring access
sudo privileges for SMART data retrieval
SSH key-based authentication to all OSD hosts

Ceph Commands Used

ceph osd tree -f json: Cluster topology
ceph osd df -f json: Disk usage statistics
ceph osd metadata osd.N -f json: OSD device information
ceph device query-daemon-health-metrics osd.N: SMART data

Output Interpretation

Replacement Score Ranges

70-100 (RED): Critical - immediate replacement recommended
50-69 (YELLOW): High priority - plan replacement soon
30-49: Medium priority - next upgrade cycle
0-29 (GREEN): Low priority - healthy drives

Health Score Ranges

80-100 (GREEN): Excellent condition
60-79 (YELLOW): Monitor for issues
40-59: Fair - multiple concerns
0-39 (RED): Critical - replace urgently

Common Issues & Solutions

"No SMART data available"

Cause: Missing smartmontools or insufficient permissions
Solution: apt install smartmontools and verify sudo access

SSH Timeout Errors

Cause: Node unreachable or SSH keys not configured
Solution: Verify connectivity with ssh -o ConnectTimeout=5 <host> hostname

Device Path Resolution Failures

Cause: Non-standard OSD deployment or encryption
Solution: Enable --debug to see device resolution attempts

dm-device Mapping Issues

Cause: LVM or LUKS encrypted OSDs
Solution: Script automatically resolves via lsblk -no pkname

Development Notes

Code Structure

Single file design: Easier to execute remotely via exec()
Minimal dependencies: Uses only Python standard library
Color-coded output: ANSI escape codes for terminal display
Debug mode: Comprehensive logging when --debug enabled

Notable Functions

run_command() (ceph_osd_analyzer.py:34-56): Universal command executor with SSH support and JSON parsing

get_device_path_for_osd() (ceph_osd_analyzer.py:84-122): Complex device resolution logic handling metadata, symlinks, and dm-devices

get_smart_data_remote() (ceph_osd_analyzer.py:124-145): Remote SMART data collection with device type detection

parse_smart_health() (ceph_osd_analyzer.py:173-269): SMART attribute parsing with device-class-specific logic

Future Enhancement Opportunities

Parallel data collection: Use threading for faster cluster-wide analysis
Historical trending: Track scores over time to predict failures
JSON output mode: For integration with monitoring systems
Cost-benefit analysis: Factor in replacement drive costs
PG rebalance impact: Estimate data movement required

Security Considerations

Permissions Required

Root access for smartctl execution
SSH access to all OSD hosts
Ceph admin keyring (read-only sufficient)

Network Requirements

Script assumes SSH connectivity between nodes
No outbound internet access required (internal-only tool)
Hardcoded internal git server URL: http://10.10.10.63:3000

SSH Configuration

Uses -o StrictHostKeyChecking=no for automated execution
5-second connection timeout to handle unreachable nodes
Assumes key-based authentication is configured

Internal Git Server: http://10.10.10.63:3000/LotusGuild/analyzeOSDs

Related Projects:

hwmonDaemon: Hardware monitoring daemon for continuous health checks
Other LotusGuild infrastructure automation tools

Maintenance

Version Control

Maintained in internal git repository
One-line execution always pulls from main branch
No formal versioning; latest commit is production

Testing Checklist

Test on cluster with mixed HDD/NVMe OSDs
Verify SSH connectivity to all hosts
Confirm SMART data retrieval for both device types
Validate dm-device resolution on encrypted OSDs
Check output formatting with various terminal widths
Test --class and --min-size filtering

Performance Characteristics

Execution Time: ~5-15 seconds per OSD depending on cluster size and SSH latency

Bottlenecks:

Serial OSD processing (parallelization would help)
SSH round-trip times for SMART data
SMART data parsing can be slow for unresponsive drives

Resource Usage: Minimal CPU/memory, I/O bound on SSH operations

Intended Audience: LotusGuild infrastructure team

Support: Submit issues or pull requests to internal git repository

8.7 KiB Raw Blame History