Comprehensive documentation update and AI development notes
Updated README.md: - Added feature list with emojis for visual clarity - Documented all output columns with descriptions - Added Ceph integration details - Included troubleshooting for common issues - Updated example output with current format - Added status indicators (✅ ⚠️) for server mapping status Created CLAUDE.md: - Documented AI-assisted development process - Chronicled evolution from basic script to comprehensive tool - Detailed technical challenges and solutions - Listed all phases of development - Provided metrics and future enhancement ideas - Lessons learned for future AI collaboration This documents the complete journey from broken PCI paths to a production-ready storage infrastructure management tool. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
209
CLAUDE.md
Normal file
209
CLAUDE.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# AI-Assisted Development Notes
|
||||
|
||||
This document chronicles the development of Drive Atlas with assistance from Claude (Anthropic's AI assistant).
|
||||
|
||||
## Project Overview
|
||||
|
||||
Drive Atlas started as a simple bash script with hardcoded drive mappings and evolved into a comprehensive storage infrastructure management tool through iterative development and user feedback.
|
||||
|
||||
## Development Session
|
||||
|
||||
**Date:** January 6, 2026
|
||||
**AI Model:** Claude Sonnet 4.5
|
||||
**Developer:** LotusGuild
|
||||
**Session Duration:** ~2 hours
|
||||
|
||||
## Initial State
|
||||
|
||||
The project began with:
|
||||
- Basic ASCII art layouts for different server chassis
|
||||
- Hardcoded drive mappings for "medium2" server
|
||||
- Simple SMART data display
|
||||
- Broken PCI path mappings (referenced non-existent hardware)
|
||||
- Windows line endings causing script execution failures
|
||||
|
||||
## Evolution Through Collaboration
|
||||
|
||||
### Phase 1: Architecture Refactoring
|
||||
**Problem:** Chassis layouts were tied to hostnames, making it hard to reuse templates.
|
||||
|
||||
**Solution:**
|
||||
- Separated chassis types from server hostnames
|
||||
- Created reusable layout generator functions
|
||||
- Introduced `CHASSIS_TYPES` and `SERVER_MAPPINGS` arrays
|
||||
- Renamed "medium2" → "compute-storage-01" for clarity
|
||||
|
||||
### Phase 2: Hardware Discovery
|
||||
**Problem:** Script referenced PCI controller `0c:00.0` which didn't exist.
|
||||
|
||||
**Approach:**
|
||||
1. Created diagnostic script to probe actual hardware
|
||||
2. Discovered real configuration:
|
||||
- LSI SAS3008 HBA at `01:00.0` (bays 5-10)
|
||||
- AMD SATA controller at `0d:00.0` (bays 1-4)
|
||||
- NVMe at `0e:00.0` (M.2 slot)
|
||||
3. User provided physical bay labels and visible serial numbers
|
||||
4. Iteratively refined PCI PHY to bay mappings
|
||||
|
||||
**Key Insight:** User confirmed bay 1 contained the SSD boot drive, which helped establish the correct mapping starting point.
|
||||
|
||||
### Phase 3: Physical Verification
|
||||
**Problem:** Needed to verify drive-to-bay mappings without powering down production server.
|
||||
|
||||
**Solution:**
|
||||
1. Added serial number display to script output
|
||||
2. User physically inspected visible serial numbers on drive bays
|
||||
3. Cross-referenced SMART serials with visible labels
|
||||
4. Corrected HBA PHY mappings:
|
||||
- Bay 5: phy6 (not phy2)
|
||||
- Bay 6: phy7 (not phy3)
|
||||
- Bay 7: phy5 (not phy4)
|
||||
- Bay 8: phy2 (not phy5)
|
||||
- Bay 9: phy4 (not phy6)
|
||||
- Bay 10: phy3 (not phy7)
|
||||
|
||||
### Phase 4: User Experience Improvements
|
||||
|
||||
**ASCII Art Rendering:**
|
||||
- Initial version had variable-width boxes that broke alignment
|
||||
- Fixed by using consistent 10-character wide bay boxes
|
||||
- Multiple iterations to perfect right border alignment
|
||||
|
||||
**Drive Table Enhancements:**
|
||||
- Original: Alphabetical by device name
|
||||
- Improved: Sorted by physical bay position (1-10)
|
||||
- Added BAY column to show physical location
|
||||
- Wider columns to prevent text wrapping
|
||||
|
||||
### Phase 5: Ceph Integration
|
||||
**User Request:** "Can we show ceph in/up out/down status in the table?"
|
||||
|
||||
**Implementation:**
|
||||
1. Added CEPH OSD column using `ceph-volume lvm list`
|
||||
2. Added STATUS column parsing `ceph osd tree`
|
||||
3. Initial bug: Parsed wrong columns (5 & 6 instead of correct ones)
|
||||
4. Fixed by understanding `ceph osd tree` format:
|
||||
- Column 5: STATUS (up/down)
|
||||
- Column 6: REWEIGHT (1.0 = in, 0 = out)
|
||||
|
||||
**User Request:** "Show which is the boot drive somehow?"
|
||||
|
||||
**Solution:**
|
||||
- Added USAGE column
|
||||
- Checks mount points
|
||||
- Shows "BOOT" for root filesystem
|
||||
- Shows mount point for other mounts
|
||||
- Shows "-" for Ceph OSDs (using LVM)
|
||||
|
||||
## Technical Challenges Solved
|
||||
|
||||
### 1. Line Ending Issues
|
||||
- **Problem:** `diagnose-drives.sh` had CRLF endings → script failures
|
||||
- **Solution:** `sed -i 's/\r$//'` to convert to LF
|
||||
|
||||
### 2. PCI Path Pattern Matching
|
||||
- **Problem:** Bash regex escaping for grep patterns
|
||||
- **Solution:** `grep -E "^\s*${osd_num}\s+"` for reliable matching
|
||||
|
||||
### 3. Floating Point Comparison in Bash
|
||||
- **Problem:** Bash doesn't natively support decimal comparisons
|
||||
- **Solution:** Used `bc -l` with error handling: `$(echo "$reweight > 0" | bc -l 2>/dev/null || echo 0)`
|
||||
|
||||
### 4. Associative Array Sorting
|
||||
- **Problem:** Bash associative arrays don't maintain insertion order
|
||||
- **Solution:** Extract keys, filter numeric ones, pipe to `sort -n`
|
||||
|
||||
## Key Learning Moments
|
||||
|
||||
1. **Hardware Reality vs. Assumptions:** The original script assumed controller addresses that didn't exist. Always probe actual hardware.
|
||||
|
||||
2. **Physical Verification is Essential:** Serial numbers visible on drive trays were crucial for verifying correct mappings.
|
||||
|
||||
3. **Iterative Refinement:** The script went through 15+ commits, each improving a specific aspect based on user testing and feedback.
|
||||
|
||||
4. **User-Driven Feature Evolution:** Features like Ceph integration and boot drive detection emerged organically from user needs.
|
||||
|
||||
## Commits Timeline
|
||||
|
||||
1. Initial refactoring and architecture improvements
|
||||
2. Fixed PCI path mappings based on discovered hardware
|
||||
3. Added serial numbers for physical verification
|
||||
4. Fixed ASCII art rendering issues
|
||||
5. Corrected bay mappings based on user verification
|
||||
6. Added bay-sorted output
|
||||
7. Implemented Ceph OSD tracking
|
||||
8. Added Ceph up/in status
|
||||
9. Added boot drive detection
|
||||
10. Fixed Ceph status parsing
|
||||
11. Documentation updates
|
||||
|
||||
## Collaborative Techniques Used
|
||||
|
||||
### Information Gathering
|
||||
- Asked clarifying questions about hardware configuration
|
||||
- Requested diagnostic command output
|
||||
- Had user physically verify drive locations
|
||||
|
||||
### Iterative Development
|
||||
- Made small, testable changes
|
||||
- User tested after each significant change
|
||||
- Incorporated feedback immediately
|
||||
|
||||
### Problem-Solving Approach
|
||||
1. Understand current state
|
||||
2. Identify specific issues
|
||||
3. Propose solution
|
||||
4. Implement incrementally
|
||||
5. Test and verify
|
||||
6. Refine based on feedback
|
||||
|
||||
## Metrics
|
||||
|
||||
- **Lines of Code:** ~330 (main script)
|
||||
- **Supported Chassis Types:** 4 (10-bay, large1, micro, spare)
|
||||
- **Mapped Servers:** 1 fully (compute-storage-01), 3 pending
|
||||
- **Features Added:** 10+
|
||||
- **Bugs Fixed:** 6 major, multiple minor
|
||||
- **Documentation:** Comprehensive README + this file
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements identified during development:
|
||||
|
||||
1. **Auto-detection:** Attempt to auto-map bays by testing with `hdparm` LED control
|
||||
2. **Color Output:** Use terminal colors for health status (green/red)
|
||||
3. **Historical Tracking:** Log temperature trends over time
|
||||
4. **Alert Integration:** Notify when drive health deteriorates
|
||||
5. **Web Interface:** Display chassis map in a web dashboard
|
||||
6. **Multi-server View:** Show all servers in one consolidated view
|
||||
|
||||
## Lessons for Future AI-Assisted Development
|
||||
|
||||
### What Worked Well
|
||||
- Breaking complex problems into small, testable pieces
|
||||
- Using diagnostic scripts to understand actual vs. assumed state
|
||||
- Physical verification before trusting software output
|
||||
- Comprehensive documentation alongside code
|
||||
- Git commits with detailed messages for traceability
|
||||
|
||||
### What Could Be Improved
|
||||
- Earlier physical verification would have saved iteration
|
||||
- More upfront hardware documentation would help
|
||||
- Automated testing for bay mappings (if possible)
|
||||
|
||||
## Conclusion
|
||||
|
||||
This project demonstrates effective human-AI collaboration where:
|
||||
- The AI provided technical implementation and problem-solving
|
||||
- The human provided domain knowledge, testing, and verification
|
||||
- Iterative feedback loops led to a polished, production-ready tool
|
||||
|
||||
The result is a robust infrastructure management tool that provides instant visibility into complex storage configurations across multiple servers.
|
||||
|
||||
---
|
||||
|
||||
**Development Credits:**
|
||||
- **Human Developer:** LotusGuild
|
||||
- **AI Assistant:** Claude Sonnet 4.5 (Anthropic)
|
||||
- **Development Date:** January 6, 2026
|
||||
- **Project:** Drive Atlas v1.0
|
||||
139
README.md
139
README.md
@@ -4,12 +4,15 @@ A powerful server drive mapping tool that generates visual ASCII representations
|
||||
|
||||
## Features
|
||||
|
||||
- Visual ASCII art maps showing physical drive bay layouts
|
||||
- Persistent drive identification using PCI paths (not device letters)
|
||||
- SMART health status and temperature monitoring
|
||||
- Support for SATA, NVMe, and USB drives
|
||||
- Detailed drive information including model, size, and health status
|
||||
- Per-server configuration for accurate physical-to-logical mapping
|
||||
- 🗺️ **Visual ASCII art maps** showing physical drive bay layouts
|
||||
- 🔗 **Persistent drive identification** using PCI paths (not device letters)
|
||||
- 🌡️ **SMART health monitoring** with temperature and status
|
||||
- 💾 **Multi-drive support** for SATA, NVMe, SAS, and USB drives
|
||||
- 🏷️ **Serial number tracking** for physical verification
|
||||
- 📊 **Bay-sorted output** matching physical layout
|
||||
- 🔵 **Ceph integration** showing OSD IDs and up/in status
|
||||
- 🥾 **Boot drive detection** identifying system drives
|
||||
- 🖥️ **Per-server configuration** for accurate physical-to-logical mapping
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -30,6 +33,7 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
|
||||
- `smartctl` (from smartmontools package)
|
||||
- `lsblk` and `lspci` (typically pre-installed)
|
||||
- Optional: `nvme-cli` for NVMe drives
|
||||
- Optional: `ceph-volume` and `ceph` for Ceph OSD tracking
|
||||
|
||||
## Server Configurations
|
||||
|
||||
@@ -50,23 +54,47 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
|
||||
- 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD)
|
||||
- 0d:00.0 - AMD SATA controller (bays 1-4)
|
||||
- 0e:00.0 - M.2 NVMe slot
|
||||
- **Status:** Fully mapped
|
||||
- **Status:** ✅ Fully mapped and verified
|
||||
|
||||
#### storage-01
|
||||
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap)
|
||||
- **Motherboard:** Different from compute-storage-01
|
||||
- **Controllers:** Motherboard SATA only (no HBA currently)
|
||||
- **Status:** Requires PCI path mapping
|
||||
- **Status:** ⚠️ Requires PCI path mapping
|
||||
|
||||
#### large1
|
||||
- **Chassis:** Unique 3x5 grid (15 bays total)
|
||||
- **Note:** 1/1 configuration, will not be replicated
|
||||
- **Status:** Requires PCI path mapping
|
||||
- **Status:** ⚠️ Requires PCI path mapping
|
||||
|
||||
#### compute-storage-gpu-01
|
||||
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap)
|
||||
- **Motherboard:** Same as compute-storage-01
|
||||
- **Status:** Requires PCI path mapping
|
||||
- **Status:** ⚠️ Requires PCI path mapping
|
||||
|
||||
## Output Example
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
|
||||
│ compute-storage-01 - 10-Bay Hot-swap Chassis │
|
||||
│ │
|
||||
│ M.2 NVMe: nvme0n1 │
|
||||
│ │
|
||||
│ Front Hot-swap Bays: │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │1 :sdh │ │2 :sdg │ │3 :sdi │ │4 :sdj │ │5 :sde │ │6 :sdf │ │7 :sdd │ │8 :sda │ │9 :sdc │ │10:sdb │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
=== Drive Details with SMART Status (by Bay Position) ===
|
||||
BAY DEVICE SIZE TYPE TEMP HEALTH MODEL SERIAL CEPH OSD STATUS USAGE
|
||||
----------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
1 /dev/sdh 223.6G SSD 27°C ✓ Crucial_CT240M500SSD1 14130C0E06DD - - /boot/efi
|
||||
2 /dev/sdg 1.8T HDD 26°C ✓ ST2000DM001-1ER164 Z4ZC4B6R osd.25 up/in -
|
||||
3 /dev/sdi 12.7T HDD 29°C ✓ OOS14000G 000DXND6 osd.9 up/in -
|
||||
...
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
@@ -76,13 +104,14 @@ Drive Atlas uses `/dev/disk/by-path/` to create persistent mappings between phys
|
||||
|
||||
**Example PCI path:**
|
||||
```
|
||||
pci-0000:0c:00.0-ata-1 → /dev/sda
|
||||
pci-0000:01:00.0-sas-phy6-lun-0 → /dev/sde → Bay 5
|
||||
```
|
||||
|
||||
This tells us:
|
||||
- `0000:0c:00.0` - PCI bus address of the storage controller
|
||||
- `ata-1` - Port 1 on that controller
|
||||
- Maps to physical bay 3 on compute-storage-01
|
||||
- `0000:01:00.0` - PCI bus address of the LSI SAS3008 HBA
|
||||
- `sas-phy6` - SAS PHY 6 on that controller
|
||||
- `lun-0` - Logical Unit Number
|
||||
- Maps to physical bay 5 on compute-storage-01
|
||||
|
||||
### Configuration
|
||||
|
||||
@@ -91,9 +120,10 @@ Server mappings are defined in the `SERVER_MAPPINGS` associative array in [drive
|
||||
```bash
|
||||
declare -A SERVER_MAPPINGS=(
|
||||
["compute-storage-01"]="
|
||||
pci-0000:0c:00.0-ata-1 3
|
||||
pci-0000:0c:00.0-ata-2 4
|
||||
pci-0000:0d:00.0-nvme-1 m2-1
|
||||
pci-0000:0d:00.0-ata-2 1
|
||||
pci-0000:0d:00.0-ata-1 2
|
||||
pci-0000:01:00.0-sas-phy6-lun-0 5
|
||||
pci-0000:0e:00.0-nvme-1 m2-1
|
||||
"
|
||||
)
|
||||
```
|
||||
@@ -115,10 +145,11 @@ This will show all available PCI paths and their associated drives.
|
||||
For each populated drive bay:
|
||||
|
||||
1. Note the physical bay number (labeled on chassis)
|
||||
2. Identify a unique characteristic (size, model, or serial number)
|
||||
3. Match it to the PCI path from the diagnostic output
|
||||
2. Run the main script to see serial numbers
|
||||
3. Match visible serial numbers on drives to the output
|
||||
4. Map PCI paths to bay numbers
|
||||
|
||||
**Pro tip:** If uncertain, remove one drive at a time and re-run the diagnostic to see which PCI path disappears.
|
||||
**Pro tip:** The script shows serial numbers - compare them to visible labels on drive trays to verify physical locations.
|
||||
|
||||
### Step 3: Create Mapping
|
||||
|
||||
@@ -152,30 +183,21 @@ Use debug mode to see the mappings:
|
||||
DEBUG=1 bash driveAtlas.sh
|
||||
```
|
||||
|
||||
## Output Example
|
||||
## Output Columns Explained
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ compute-storage-01 │
|
||||
│ 10-Bay Hot-swap Chassis │
|
||||
│ │
|
||||
│ M.2 NVMe Slot │
|
||||
│ ┌──────────┐ │
|
||||
│ │ nvme0n1 │ │
|
||||
│ └──────────┘ │
|
||||
│ │
|
||||
│ Front Hot-swap Bays │
|
||||
│ ┌──────────┐┌──────────┐┌──────────┐┌──────────┐... │
|
||||
│ │1: EMPTY ││2: EMPTY ││3: sda ││4: sdb │... │
|
||||
│ └──────────┘└──────────┘└──────────┘└──────────┘... │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
|
||||
=== Drive Details with SMART Status ===
|
||||
DEVICE SIZE TYPE TEMP HEALTH MODEL
|
||||
--------------------------------------------------------------------------------
|
||||
/dev/sda 2TB HDD 35°C ✓ WD20EFRX-68EUZN0
|
||||
/dev/nvme0n1 1TB SSD 42°C ✓ Samsung 980 PRO
|
||||
```
|
||||
| Column | Description |
|
||||
|--------|-------------|
|
||||
| **BAY** | Physical bay number (1-10, m2-1, etc.) |
|
||||
| **DEVICE** | Linux device name (/dev/sdX, /dev/nvmeXnY) |
|
||||
| **SIZE** | Drive capacity |
|
||||
| **TYPE** | SSD or HDD (detected via SMART) |
|
||||
| **TEMP** | Current temperature from SMART |
|
||||
| **HEALTH** | SMART health status (✓ = passed, ✗ = failed) |
|
||||
| **MODEL** | Drive model number |
|
||||
| **SERIAL** | Drive serial number (for physical verification) |
|
||||
| **CEPH OSD** | Ceph OSD ID if drive hosts an OSD |
|
||||
| **STATUS** | Ceph OSD status (up/in, down/out, etc.) |
|
||||
| **USAGE** | Mount point or "BOOT" for system drive |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
@@ -190,7 +212,7 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL
|
||||
- Even identical motherboards can have different PCI addressing
|
||||
- BIOS settings can affect PCI enumeration
|
||||
- HBA installation in different PCIe slots changes addresses
|
||||
- Cable routing to different SATA ports changes the ata-N number
|
||||
- Cable routing to different SATA ports changes the ata-N or phy-N number
|
||||
|
||||
### SMART data not showing
|
||||
|
||||
@@ -199,19 +221,32 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL
|
||||
- USB-connected drives may not support SMART
|
||||
- Run `sudo smartctl -i /dev/sdX` manually to check
|
||||
|
||||
### Ceph OSD status shows "unknown/out"
|
||||
|
||||
- Ensure `ceph` and `ceph-volume` commands are available
|
||||
- Check if the Ceph cluster is healthy: `ceph -s`
|
||||
- Verify OSD is actually up: `ceph osd tree`
|
||||
|
||||
### Serial numbers don't match visible labels
|
||||
|
||||
- Some manufacturers use different serials for SMART vs. physical labels
|
||||
- Cross-reference by drive model and size
|
||||
- Use the removal method: power down, remove drive, check which bay becomes EMPTY
|
||||
|
||||
## Files
|
||||
|
||||
- [driveAtlas.sh](driveAtlas.sh) - Main script
|
||||
- [diagnose-drives.sh](diagnose-drives.sh) - PCI path diagnostic tool
|
||||
- [README.md](README.md) - This file
|
||||
- [todo.txt](todo.txt) - Development notes
|
||||
- [CLAUDE.md](CLAUDE.md) - AI-assisted development notes
|
||||
- [todo.txt](todo.txt) - Development notes and task tracking
|
||||
|
||||
## Contributing
|
||||
|
||||
When adding support for a new server:
|
||||
|
||||
1. Run `diagnose-drives.sh` and save output
|
||||
2. Physically label or identify drives
|
||||
2. Physically label or identify drives by serial number
|
||||
3. Create mapping in `SERVER_MAPPINGS`
|
||||
4. Test thoroughly
|
||||
5. Document any unique hardware configurations
|
||||
@@ -231,11 +266,15 @@ PCI paths are deterministic and based on physical hardware topology.
|
||||
|
||||
### Bay Numbering Conventions
|
||||
|
||||
- **10-bay chassis:** Bays numbered 1-10 (left to right, top to bottom)
|
||||
- **10-bay chassis:** Bays numbered 1-10 (left to right, typically)
|
||||
- **M.2 slots:** Labeled as `m2-1`, `m2-2`, etc.
|
||||
- **USB drives:** Labeled as `usb1`, `usb2`, etc.
|
||||
- **Large1:** Grid numbering 1-9 (3x3 displayed, additional bays documented in mapping)
|
||||
- **Large1:** Grid numbering 1-15 (documented in mapping)
|
||||
|
||||
## License
|
||||
### Ceph Integration
|
||||
|
||||
Internal tool for LotusGuild infrastructure.
|
||||
The script automatically detects Ceph OSDs using:
|
||||
1. `ceph-volume lvm list` to map devices to OSD IDs
|
||||
2. `ceph osd tree` to get up/down and in/out status
|
||||
|
||||
Status format: `up/in` means OSD is running and participating in the cluster.
|
||||
Reference in New Issue
Block a user