Comprehensive documentation update and AI development notes

Updated README.md:
- Added feature list with emojis for visual clarity
- Documented all output columns with descriptions
- Added Ceph integration details
- Included troubleshooting for common issues
- Updated example output with current format
- Added status indicators ( ⚠️) for server mapping status

Created CLAUDE.md:
- Documented AI-assisted development process
- Chronicled evolution from basic script to comprehensive tool
- Detailed technical challenges and solutions
- Listed all phases of development
- Provided metrics and future enhancement ideas
- Lessons learned for future AI collaboration

This documents the complete journey from broken PCI paths to a
production-ready storage infrastructure management tool.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-06 16:34:22 -05:00
parent 418d4d4170
commit 40ab528f40
2 changed files with 298 additions and 50 deletions

139
README.md
View File

@@ -4,12 +4,15 @@ A powerful server drive mapping tool that generates visual ASCII representations
## Features
- Visual ASCII art maps showing physical drive bay layouts
- Persistent drive identification using PCI paths (not device letters)
- SMART health status and temperature monitoring
- Support for SATA, NVMe, and USB drives
- Detailed drive information including model, size, and health status
- Per-server configuration for accurate physical-to-logical mapping
- 🗺️ **Visual ASCII art maps** showing physical drive bay layouts
- 🔗 **Persistent drive identification** using PCI paths (not device letters)
- 🌡️ **SMART health monitoring** with temperature and status
- 💾 **Multi-drive support** for SATA, NVMe, SAS, and USB drives
- 🏷️ **Serial number tracking** for physical verification
- 📊 **Bay-sorted output** matching physical layout
- 🔵 **Ceph integration** showing OSD IDs and up/in status
- 🥾 **Boot drive detection** identifying system drives
- 🖥️ **Per-server configuration** for accurate physical-to-logical mapping
## Quick Start
@@ -30,6 +33,7 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
- `smartctl` (from smartmontools package)
- `lsblk` and `lspci` (typically pre-installed)
- Optional: `nvme-cli` for NVMe drives
- Optional: `ceph-volume` and `ceph` for Ceph OSD tracking
## Server Configurations
@@ -50,23 +54,47 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
- 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD)
- 0d:00.0 - AMD SATA controller (bays 1-4)
- 0e:00.0 - M.2 NVMe slot
- **Status:** Fully mapped
- **Status:** Fully mapped and verified
#### storage-01
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap)
- **Motherboard:** Different from compute-storage-01
- **Controllers:** Motherboard SATA only (no HBA currently)
- **Status:** Requires PCI path mapping
- **Status:** ⚠️ Requires PCI path mapping
#### large1
- **Chassis:** Unique 3x5 grid (15 bays total)
- **Note:** 1/1 configuration, will not be replicated
- **Status:** Requires PCI path mapping
- **Status:** ⚠️ Requires PCI path mapping
#### compute-storage-gpu-01
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap)
- **Motherboard:** Same as compute-storage-01
- **Status:** Requires PCI path mapping
- **Status:** ⚠️ Requires PCI path mapping
## Output Example
```
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ compute-storage-01 - 10-Bay Hot-swap Chassis │
│ │
│ M.2 NVMe: nvme0n1 │
│ │
│ Front Hot-swap Bays: │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │1 :sdh │ │2 :sdg │ │3 :sdi │ │4 :sdj │ │5 :sde │ │6 :sdf │ │7 :sdd │ │8 :sda │ │9 :sdc │ │10:sdb │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
=== Drive Details with SMART Status (by Bay Position) ===
BAY DEVICE SIZE TYPE TEMP HEALTH MODEL SERIAL CEPH OSD STATUS USAGE
----------------------------------------------------------------------------------------------------------------------------------------------------
1 /dev/sdh 223.6G SSD 27°C ✓ Crucial_CT240M500SSD1 14130C0E06DD - - /boot/efi
2 /dev/sdg 1.8T HDD 26°C ✓ ST2000DM001-1ER164 Z4ZC4B6R osd.25 up/in -
3 /dev/sdi 12.7T HDD 29°C ✓ OOS14000G 000DXND6 osd.9 up/in -
...
```
## How It Works
@@ -76,13 +104,14 @@ Drive Atlas uses `/dev/disk/by-path/` to create persistent mappings between phys
**Example PCI path:**
```
pci-0000:0c:00.0-ata-1 → /dev/sda
pci-0000:01:00.0-sas-phy6-lun-0 → /dev/sde → Bay 5
```
This tells us:
- `0000:0c:00.0` - PCI bus address of the storage controller
- `ata-1` - Port 1 on that controller
- Maps to physical bay 3 on compute-storage-01
- `0000:01:00.0` - PCI bus address of the LSI SAS3008 HBA
- `sas-phy6` - SAS PHY 6 on that controller
- `lun-0` - Logical Unit Number
- Maps to physical bay 5 on compute-storage-01
### Configuration
@@ -91,9 +120,10 @@ Server mappings are defined in the `SERVER_MAPPINGS` associative array in [drive
```bash
declare -A SERVER_MAPPINGS=(
["compute-storage-01"]="
pci-0000:0c:00.0-ata-1 3
pci-0000:0c:00.0-ata-2 4
pci-0000:0d:00.0-nvme-1 m2-1
pci-0000:0d:00.0-ata-2 1
pci-0000:0d:00.0-ata-1 2
pci-0000:01:00.0-sas-phy6-lun-0 5
pci-0000:0e:00.0-nvme-1 m2-1
"
)
```
@@ -115,10 +145,11 @@ This will show all available PCI paths and their associated drives.
For each populated drive bay:
1. Note the physical bay number (labeled on chassis)
2. Identify a unique characteristic (size, model, or serial number)
3. Match it to the PCI path from the diagnostic output
2. Run the main script to see serial numbers
3. Match visible serial numbers on drives to the output
4. Map PCI paths to bay numbers
**Pro tip:** If uncertain, remove one drive at a time and re-run the diagnostic to see which PCI path disappears.
**Pro tip:** The script shows serial numbers - compare them to visible labels on drive trays to verify physical locations.
### Step 3: Create Mapping
@@ -152,30 +183,21 @@ Use debug mode to see the mappings:
DEBUG=1 bash driveAtlas.sh
```
## Output Example
## Output Columns Explained
```
┌──────────────────────────────────────────────────────────────┐
│ compute-storage-01 │
│ 10-Bay Hot-swap Chassis │
│ │
│ M.2 NVMe Slot │
│ ┌──────────┐ │
│ │ nvme0n1 │ │
│ └──────────┘ │
│ │
│ Front Hot-swap Bays │
│ ┌──────────┐┌──────────┐┌──────────┐┌──────────┐... │
│ │1: EMPTY ││2: EMPTY ││3: sda ││4: sdb │... │
│ └──────────┘└──────────┘└──────────┘└──────────┘... │
└──────────────────────────────────────────────────────────────┘
=== Drive Details with SMART Status ===
DEVICE SIZE TYPE TEMP HEALTH MODEL
--------------------------------------------------------------------------------
/dev/sda 2TB HDD 35°C ✓ WD20EFRX-68EUZN0
/dev/nvme0n1 1TB SSD 42°C ✓ Samsung 980 PRO
```
| Column | Description |
|--------|-------------|
| **BAY** | Physical bay number (1-10, m2-1, etc.) |
| **DEVICE** | Linux device name (/dev/sdX, /dev/nvmeXnY) |
| **SIZE** | Drive capacity |
| **TYPE** | SSD or HDD (detected via SMART) |
| **TEMP** | Current temperature from SMART |
| **HEALTH** | SMART health status (✓ = passed, ✗ = failed) |
| **MODEL** | Drive model number |
| **SERIAL** | Drive serial number (for physical verification) |
| **CEPH OSD** | Ceph OSD ID if drive hosts an OSD |
| **STATUS** | Ceph OSD status (up/in, down/out, etc.) |
| **USAGE** | Mount point or "BOOT" for system drive |
## Troubleshooting
@@ -190,7 +212,7 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL
- Even identical motherboards can have different PCI addressing
- BIOS settings can affect PCI enumeration
- HBA installation in different PCIe slots changes addresses
- Cable routing to different SATA ports changes the ata-N number
- Cable routing to different SATA ports changes the ata-N or phy-N number
### SMART data not showing
@@ -199,19 +221,32 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL
- USB-connected drives may not support SMART
- Run `sudo smartctl -i /dev/sdX` manually to check
### Ceph OSD status shows "unknown/out"
- Ensure `ceph` and `ceph-volume` commands are available
- Check if the Ceph cluster is healthy: `ceph -s`
- Verify OSD is actually up: `ceph osd tree`
### Serial numbers don't match visible labels
- Some manufacturers use different serials for SMART vs. physical labels
- Cross-reference by drive model and size
- Use the removal method: power down, remove drive, check which bay becomes EMPTY
## Files
- [driveAtlas.sh](driveAtlas.sh) - Main script
- [diagnose-drives.sh](diagnose-drives.sh) - PCI path diagnostic tool
- [README.md](README.md) - This file
- [todo.txt](todo.txt) - Development notes
- [CLAUDE.md](CLAUDE.md) - AI-assisted development notes
- [todo.txt](todo.txt) - Development notes and task tracking
## Contributing
When adding support for a new server:
1. Run `diagnose-drives.sh` and save output
2. Physically label or identify drives
2. Physically label or identify drives by serial number
3. Create mapping in `SERVER_MAPPINGS`
4. Test thoroughly
5. Document any unique hardware configurations
@@ -231,11 +266,15 @@ PCI paths are deterministic and based on physical hardware topology.
### Bay Numbering Conventions
- **10-bay chassis:** Bays numbered 1-10 (left to right, top to bottom)
- **10-bay chassis:** Bays numbered 1-10 (left to right, typically)
- **M.2 slots:** Labeled as `m2-1`, `m2-2`, etc.
- **USB drives:** Labeled as `usb1`, `usb2`, etc.
- **Large1:** Grid numbering 1-9 (3x3 displayed, additional bays documented in mapping)
- **Large1:** Grid numbering 1-15 (documented in mapping)
## License
### Ceph Integration
Internal tool for LotusGuild infrastructure.
The script automatically detects Ceph OSDs using:
1. `ceph-volume lvm list` to map devices to OSD IDs
2. `ceph osd tree` to get up/down and in/out status
Status format: `up/in` means OSD is running and participating in the cluster.