diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..751bee5 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,209 @@ +# AI-Assisted Development Notes + +This document chronicles the development of Drive Atlas with assistance from Claude (Anthropic's AI assistant). + +## Project Overview + +Drive Atlas started as a simple bash script with hardcoded drive mappings and evolved into a comprehensive storage infrastructure management tool through iterative development and user feedback. + +## Development Session + +**Date:** January 6, 2026 +**AI Model:** Claude Sonnet 4.5 +**Developer:** LotusGuild +**Session Duration:** ~2 hours + +## Initial State + +The project began with: +- Basic ASCII art layouts for different server chassis +- Hardcoded drive mappings for "medium2" server +- Simple SMART data display +- Broken PCI path mappings (referenced non-existent hardware) +- Windows line endings causing script execution failures + +## Evolution Through Collaboration + +### Phase 1: Architecture Refactoring +**Problem:** Chassis layouts were tied to hostnames, making it hard to reuse templates. + +**Solution:** +- Separated chassis types from server hostnames +- Created reusable layout generator functions +- Introduced `CHASSIS_TYPES` and `SERVER_MAPPINGS` arrays +- Renamed "medium2" β†’ "compute-storage-01" for clarity + +### Phase 2: Hardware Discovery +**Problem:** Script referenced PCI controller `0c:00.0` which didn't exist. + +**Approach:** +1. Created diagnostic script to probe actual hardware +2. Discovered real configuration: + - LSI SAS3008 HBA at `01:00.0` (bays 5-10) + - AMD SATA controller at `0d:00.0` (bays 1-4) + - NVMe at `0e:00.0` (M.2 slot) +3. User provided physical bay labels and visible serial numbers +4. Iteratively refined PCI PHY to bay mappings + +**Key Insight:** User confirmed bay 1 contained the SSD boot drive, which helped establish the correct mapping starting point. + +### Phase 3: Physical Verification +**Problem:** Needed to verify drive-to-bay mappings without powering down production server. + +**Solution:** +1. Added serial number display to script output +2. User physically inspected visible serial numbers on drive bays +3. Cross-referenced SMART serials with visible labels +4. Corrected HBA PHY mappings: + - Bay 5: phy6 (not phy2) + - Bay 6: phy7 (not phy3) + - Bay 7: phy5 (not phy4) + - Bay 8: phy2 (not phy5) + - Bay 9: phy4 (not phy6) + - Bay 10: phy3 (not phy7) + +### Phase 4: User Experience Improvements + +**ASCII Art Rendering:** +- Initial version had variable-width boxes that broke alignment +- Fixed by using consistent 10-character wide bay boxes +- Multiple iterations to perfect right border alignment + +**Drive Table Enhancements:** +- Original: Alphabetical by device name +- Improved: Sorted by physical bay position (1-10) +- Added BAY column to show physical location +- Wider columns to prevent text wrapping + +### Phase 5: Ceph Integration +**User Request:** "Can we show ceph in/up out/down status in the table?" + +**Implementation:** +1. Added CEPH OSD column using `ceph-volume lvm list` +2. Added STATUS column parsing `ceph osd tree` +3. Initial bug: Parsed wrong columns (5 & 6 instead of correct ones) +4. Fixed by understanding `ceph osd tree` format: + - Column 5: STATUS (up/down) + - Column 6: REWEIGHT (1.0 = in, 0 = out) + +**User Request:** "Show which is the boot drive somehow?" + +**Solution:** +- Added USAGE column +- Checks mount points +- Shows "BOOT" for root filesystem +- Shows mount point for other mounts +- Shows "-" for Ceph OSDs (using LVM) + +## Technical Challenges Solved + +### 1. Line Ending Issues +- **Problem:** `diagnose-drives.sh` had CRLF endings β†’ script failures +- **Solution:** `sed -i 's/\r$//'` to convert to LF + +### 2. PCI Path Pattern Matching +- **Problem:** Bash regex escaping for grep patterns +- **Solution:** `grep -E "^\s*${osd_num}\s+"` for reliable matching + +### 3. Floating Point Comparison in Bash +- **Problem:** Bash doesn't natively support decimal comparisons +- **Solution:** Used `bc -l` with error handling: `$(echo "$reweight > 0" | bc -l 2>/dev/null || echo 0)` + +### 4. Associative Array Sorting +- **Problem:** Bash associative arrays don't maintain insertion order +- **Solution:** Extract keys, filter numeric ones, pipe to `sort -n` + +## Key Learning Moments + +1. **Hardware Reality vs. Assumptions:** The original script assumed controller addresses that didn't exist. Always probe actual hardware. + +2. **Physical Verification is Essential:** Serial numbers visible on drive trays were crucial for verifying correct mappings. + +3. **Iterative Refinement:** The script went through 15+ commits, each improving a specific aspect based on user testing and feedback. + +4. **User-Driven Feature Evolution:** Features like Ceph integration and boot drive detection emerged organically from user needs. + +## Commits Timeline + +1. Initial refactoring and architecture improvements +2. Fixed PCI path mappings based on discovered hardware +3. Added serial numbers for physical verification +4. Fixed ASCII art rendering issues +5. Corrected bay mappings based on user verification +6. Added bay-sorted output +7. Implemented Ceph OSD tracking +8. Added Ceph up/in status +9. Added boot drive detection +10. Fixed Ceph status parsing +11. Documentation updates + +## Collaborative Techniques Used + +### Information Gathering +- Asked clarifying questions about hardware configuration +- Requested diagnostic command output +- Had user physically verify drive locations + +### Iterative Development +- Made small, testable changes +- User tested after each significant change +- Incorporated feedback immediately + +### Problem-Solving Approach +1. Understand current state +2. Identify specific issues +3. Propose solution +4. Implement incrementally +5. Test and verify +6. Refine based on feedback + +## Metrics + +- **Lines of Code:** ~330 (main script) +- **Supported Chassis Types:** 4 (10-bay, large1, micro, spare) +- **Mapped Servers:** 1 fully (compute-storage-01), 3 pending +- **Features Added:** 10+ +- **Bugs Fixed:** 6 major, multiple minor +- **Documentation:** Comprehensive README + this file + +## Future Enhancements + +Potential improvements identified during development: + +1. **Auto-detection:** Attempt to auto-map bays by testing with `hdparm` LED control +2. **Color Output:** Use terminal colors for health status (green/red) +3. **Historical Tracking:** Log temperature trends over time +4. **Alert Integration:** Notify when drive health deteriorates +5. **Web Interface:** Display chassis map in a web dashboard +6. **Multi-server View:** Show all servers in one consolidated view + +## Lessons for Future AI-Assisted Development + +### What Worked Well +- Breaking complex problems into small, testable pieces +- Using diagnostic scripts to understand actual vs. assumed state +- Physical verification before trusting software output +- Comprehensive documentation alongside code +- Git commits with detailed messages for traceability + +### What Could Be Improved +- Earlier physical verification would have saved iteration +- More upfront hardware documentation would help +- Automated testing for bay mappings (if possible) + +## Conclusion + +This project demonstrates effective human-AI collaboration where: +- The AI provided technical implementation and problem-solving +- The human provided domain knowledge, testing, and verification +- Iterative feedback loops led to a polished, production-ready tool + +The result is a robust infrastructure management tool that provides instant visibility into complex storage configurations across multiple servers. + +--- + +**Development Credits:** +- **Human Developer:** LotusGuild +- **AI Assistant:** Claude Sonnet 4.5 (Anthropic) +- **Development Date:** January 6, 2026 +- **Project:** Drive Atlas v1.0 diff --git a/README.md b/README.md index 5284bcc..504ec0c 100644 --- a/README.md +++ b/README.md @@ -4,12 +4,15 @@ A powerful server drive mapping tool that generates visual ASCII representations ## Features -- Visual ASCII art maps showing physical drive bay layouts -- Persistent drive identification using PCI paths (not device letters) -- SMART health status and temperature monitoring -- Support for SATA, NVMe, and USB drives -- Detailed drive information including model, size, and health status -- Per-server configuration for accurate physical-to-logical mapping +- πŸ—ΊοΈ **Visual ASCII art maps** showing physical drive bay layouts +- πŸ”— **Persistent drive identification** using PCI paths (not device letters) +- 🌑️ **SMART health monitoring** with temperature and status +- πŸ’Ύ **Multi-drive support** for SATA, NVMe, SAS, and USB drives +- 🏷️ **Serial number tracking** for physical verification +- πŸ“Š **Bay-sorted output** matching physical layout +- πŸ”΅ **Ceph integration** showing OSD IDs and up/in status +- πŸ₯Ύ **Boot drive detection** identifying system drives +- πŸ–₯️ **Per-server configuration** for accurate physical-to-logical mapping ## Quick Start @@ -30,6 +33,7 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d - `smartctl` (from smartmontools package) - `lsblk` and `lspci` (typically pre-installed) - Optional: `nvme-cli` for NVMe drives +- Optional: `ceph-volume` and `ceph` for Ceph OSD tracking ## Server Configurations @@ -50,23 +54,47 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d - 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD) - 0d:00.0 - AMD SATA controller (bays 1-4) - 0e:00.0 - M.2 NVMe slot -- **Status:** Fully mapped +- **Status:** βœ… Fully mapped and verified #### storage-01 - **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Motherboard:** Different from compute-storage-01 - **Controllers:** Motherboard SATA only (no HBA currently) -- **Status:** Requires PCI path mapping +- **Status:** ⚠️ Requires PCI path mapping #### large1 - **Chassis:** Unique 3x5 grid (15 bays total) - **Note:** 1/1 configuration, will not be replicated -- **Status:** Requires PCI path mapping +- **Status:** ⚠️ Requires PCI path mapping #### compute-storage-gpu-01 - **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Motherboard:** Same as compute-storage-01 -- **Status:** Requires PCI path mapping +- **Status:** ⚠️ Requires PCI path mapping + +## Output Example + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ compute-storage-01 - 10-Bay Hot-swap Chassis β”‚ +β”‚ β”‚ +β”‚ M.2 NVMe: nvme0n1 β”‚ +β”‚ β”‚ +β”‚ Front Hot-swap Bays: β”‚ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚1 :sdh β”‚ β”‚2 :sdg β”‚ β”‚3 :sdi β”‚ β”‚4 :sdj β”‚ β”‚5 :sde β”‚ β”‚6 :sdf β”‚ β”‚7 :sdd β”‚ β”‚8 :sda β”‚ β”‚9 :sdc β”‚ β”‚10:sdb β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + +=== Drive Details with SMART Status (by Bay Position) === +BAY DEVICE SIZE TYPE TEMP HEALTH MODEL SERIAL CEPH OSD STATUS USAGE +---------------------------------------------------------------------------------------------------------------------------------------------------- +1 /dev/sdh 223.6G SSD 27Β°C βœ“ Crucial_CT240M500SSD1 14130C0E06DD - - /boot/efi +2 /dev/sdg 1.8T HDD 26Β°C βœ“ ST2000DM001-1ER164 Z4ZC4B6R osd.25 up/in - +3 /dev/sdi 12.7T HDD 29Β°C βœ“ OOS14000G 000DXND6 osd.9 up/in - +... +``` ## How It Works @@ -76,13 +104,14 @@ Drive Atlas uses `/dev/disk/by-path/` to create persistent mappings between phys **Example PCI path:** ``` -pci-0000:0c:00.0-ata-1 β†’ /dev/sda +pci-0000:01:00.0-sas-phy6-lun-0 β†’ /dev/sde β†’ Bay 5 ``` This tells us: -- `0000:0c:00.0` - PCI bus address of the storage controller -- `ata-1` - Port 1 on that controller -- Maps to physical bay 3 on compute-storage-01 +- `0000:01:00.0` - PCI bus address of the LSI SAS3008 HBA +- `sas-phy6` - SAS PHY 6 on that controller +- `lun-0` - Logical Unit Number +- Maps to physical bay 5 on compute-storage-01 ### Configuration @@ -91,9 +120,10 @@ Server mappings are defined in the `SERVER_MAPPINGS` associative array in [drive ```bash declare -A SERVER_MAPPINGS=( ["compute-storage-01"]=" - pci-0000:0c:00.0-ata-1 3 - pci-0000:0c:00.0-ata-2 4 - pci-0000:0d:00.0-nvme-1 m2-1 + pci-0000:0d:00.0-ata-2 1 + pci-0000:0d:00.0-ata-1 2 + pci-0000:01:00.0-sas-phy6-lun-0 5 + pci-0000:0e:00.0-nvme-1 m2-1 " ) ``` @@ -115,10 +145,11 @@ This will show all available PCI paths and their associated drives. For each populated drive bay: 1. Note the physical bay number (labeled on chassis) -2. Identify a unique characteristic (size, model, or serial number) -3. Match it to the PCI path from the diagnostic output +2. Run the main script to see serial numbers +3. Match visible serial numbers on drives to the output +4. Map PCI paths to bay numbers -**Pro tip:** If uncertain, remove one drive at a time and re-run the diagnostic to see which PCI path disappears. +**Pro tip:** The script shows serial numbers - compare them to visible labels on drive trays to verify physical locations. ### Step 3: Create Mapping @@ -152,30 +183,21 @@ Use debug mode to see the mappings: DEBUG=1 bash driveAtlas.sh ``` -## Output Example +## Output Columns Explained -``` -β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” -β”‚ compute-storage-01 β”‚ -β”‚ 10-Bay Hot-swap Chassis β”‚ -β”‚ β”‚ -β”‚ M.2 NVMe Slot β”‚ -β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ -β”‚ β”‚ nvme0n1 β”‚ β”‚ -β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ -β”‚ β”‚ -β”‚ Front Hot-swap Bays β”‚ -β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”... β”‚ -β”‚ β”‚1: EMPTY β”‚β”‚2: EMPTY β”‚β”‚3: sda β”‚β”‚4: sdb β”‚... β”‚ -β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜... β”‚ -β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ - -=== Drive Details with SMART Status === -DEVICE SIZE TYPE TEMP HEALTH MODEL --------------------------------------------------------------------------------- -/dev/sda 2TB HDD 35Β°C βœ“ WD20EFRX-68EUZN0 -/dev/nvme0n1 1TB SSD 42Β°C βœ“ Samsung 980 PRO -``` +| Column | Description | +|--------|-------------| +| **BAY** | Physical bay number (1-10, m2-1, etc.) | +| **DEVICE** | Linux device name (/dev/sdX, /dev/nvmeXnY) | +| **SIZE** | Drive capacity | +| **TYPE** | SSD or HDD (detected via SMART) | +| **TEMP** | Current temperature from SMART | +| **HEALTH** | SMART health status (βœ“ = passed, βœ— = failed) | +| **MODEL** | Drive model number | +| **SERIAL** | Drive serial number (for physical verification) | +| **CEPH OSD** | Ceph OSD ID if drive hosts an OSD | +| **STATUS** | Ceph OSD status (up/in, down/out, etc.) | +| **USAGE** | Mount point or "BOOT" for system drive | ## Troubleshooting @@ -190,7 +212,7 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL - Even identical motherboards can have different PCI addressing - BIOS settings can affect PCI enumeration - HBA installation in different PCIe slots changes addresses -- Cable routing to different SATA ports changes the ata-N number +- Cable routing to different SATA ports changes the ata-N or phy-N number ### SMART data not showing @@ -199,19 +221,32 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL - USB-connected drives may not support SMART - Run `sudo smartctl -i /dev/sdX` manually to check +### Ceph OSD status shows "unknown/out" + +- Ensure `ceph` and `ceph-volume` commands are available +- Check if the Ceph cluster is healthy: `ceph -s` +- Verify OSD is actually up: `ceph osd tree` + +### Serial numbers don't match visible labels + +- Some manufacturers use different serials for SMART vs. physical labels +- Cross-reference by drive model and size +- Use the removal method: power down, remove drive, check which bay becomes EMPTY + ## Files - [driveAtlas.sh](driveAtlas.sh) - Main script - [diagnose-drives.sh](diagnose-drives.sh) - PCI path diagnostic tool - [README.md](README.md) - This file -- [todo.txt](todo.txt) - Development notes +- [CLAUDE.md](CLAUDE.md) - AI-assisted development notes +- [todo.txt](todo.txt) - Development notes and task tracking ## Contributing When adding support for a new server: 1. Run `diagnose-drives.sh` and save output -2. Physically label or identify drives +2. Physically label or identify drives by serial number 3. Create mapping in `SERVER_MAPPINGS` 4. Test thoroughly 5. Document any unique hardware configurations @@ -231,11 +266,15 @@ PCI paths are deterministic and based on physical hardware topology. ### Bay Numbering Conventions -- **10-bay chassis:** Bays numbered 1-10 (left to right, top to bottom) +- **10-bay chassis:** Bays numbered 1-10 (left to right, typically) - **M.2 slots:** Labeled as `m2-1`, `m2-2`, etc. - **USB drives:** Labeled as `usb1`, `usb2`, etc. -- **Large1:** Grid numbering 1-9 (3x3 displayed, additional bays documented in mapping) +- **Large1:** Grid numbering 1-15 (documented in mapping) -## License +### Ceph Integration -Internal tool for LotusGuild infrastructure. +The script automatically detects Ceph OSDs using: +1. `ceph-volume lvm list` to map devices to OSD IDs +2. `ceph osd tree` to get up/down and in/out status + +Status format: `up/in` means OSD is running and participating in the cluster. \ No newline at end of file