Comprehensive documentation update and AI development notes
Updated README.md: - Added feature list with emojis for visual clarity - Documented all output columns with descriptions - Added Ceph integration details - Included troubleshooting for common issues - Updated example output with current format - Added status indicators (✅ ⚠️) for server mapping status Created CLAUDE.md: - Documented AI-assisted development process - Chronicled evolution from basic script to comprehensive tool - Detailed technical challenges and solutions - Listed all phases of development - Provided metrics and future enhancement ideas - Lessons learned for future AI collaboration This documents the complete journey from broken PCI paths to a production-ready storage infrastructure management tool. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
139
README.md
139
README.md
@@ -4,12 +4,15 @@ A powerful server drive mapping tool that generates visual ASCII representations
|
||||
|
||||
## Features
|
||||
|
||||
- Visual ASCII art maps showing physical drive bay layouts
|
||||
- Persistent drive identification using PCI paths (not device letters)
|
||||
- SMART health status and temperature monitoring
|
||||
- Support for SATA, NVMe, and USB drives
|
||||
- Detailed drive information including model, size, and health status
|
||||
- Per-server configuration for accurate physical-to-logical mapping
|
||||
- 🗺️ **Visual ASCII art maps** showing physical drive bay layouts
|
||||
- 🔗 **Persistent drive identification** using PCI paths (not device letters)
|
||||
- 🌡️ **SMART health monitoring** with temperature and status
|
||||
- 💾 **Multi-drive support** for SATA, NVMe, SAS, and USB drives
|
||||
- 🏷️ **Serial number tracking** for physical verification
|
||||
- 📊 **Bay-sorted output** matching physical layout
|
||||
- 🔵 **Ceph integration** showing OSD IDs and up/in status
|
||||
- 🥾 **Boot drive detection** identifying system drives
|
||||
- 🖥️ **Per-server configuration** for accurate physical-to-logical mapping
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -30,6 +33,7 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
|
||||
- `smartctl` (from smartmontools package)
|
||||
- `lsblk` and `lspci` (typically pre-installed)
|
||||
- Optional: `nvme-cli` for NVMe drives
|
||||
- Optional: `ceph-volume` and `ceph` for Ceph OSD tracking
|
||||
|
||||
## Server Configurations
|
||||
|
||||
@@ -50,23 +54,47 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
|
||||
- 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD)
|
||||
- 0d:00.0 - AMD SATA controller (bays 1-4)
|
||||
- 0e:00.0 - M.2 NVMe slot
|
||||
- **Status:** Fully mapped
|
||||
- **Status:** ✅ Fully mapped and verified
|
||||
|
||||
#### storage-01
|
||||
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap)
|
||||
- **Motherboard:** Different from compute-storage-01
|
||||
- **Controllers:** Motherboard SATA only (no HBA currently)
|
||||
- **Status:** Requires PCI path mapping
|
||||
- **Status:** ⚠️ Requires PCI path mapping
|
||||
|
||||
#### large1
|
||||
- **Chassis:** Unique 3x5 grid (15 bays total)
|
||||
- **Note:** 1/1 configuration, will not be replicated
|
||||
- **Status:** Requires PCI path mapping
|
||||
- **Status:** ⚠️ Requires PCI path mapping
|
||||
|
||||
#### compute-storage-gpu-01
|
||||
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap)
|
||||
- **Motherboard:** Same as compute-storage-01
|
||||
- **Status:** Requires PCI path mapping
|
||||
- **Status:** ⚠️ Requires PCI path mapping
|
||||
|
||||
## Output Example
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
|
||||
│ compute-storage-01 - 10-Bay Hot-swap Chassis │
|
||||
│ │
|
||||
│ M.2 NVMe: nvme0n1 │
|
||||
│ │
|
||||
│ Front Hot-swap Bays: │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │1 :sdh │ │2 :sdg │ │3 :sdi │ │4 :sdj │ │5 :sde │ │6 :sdf │ │7 :sdd │ │8 :sda │ │9 :sdc │ │10:sdb │ │
|
||||
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
||||
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
=== Drive Details with SMART Status (by Bay Position) ===
|
||||
BAY DEVICE SIZE TYPE TEMP HEALTH MODEL SERIAL CEPH OSD STATUS USAGE
|
||||
----------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
1 /dev/sdh 223.6G SSD 27°C ✓ Crucial_CT240M500SSD1 14130C0E06DD - - /boot/efi
|
||||
2 /dev/sdg 1.8T HDD 26°C ✓ ST2000DM001-1ER164 Z4ZC4B6R osd.25 up/in -
|
||||
3 /dev/sdi 12.7T HDD 29°C ✓ OOS14000G 000DXND6 osd.9 up/in -
|
||||
...
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
@@ -76,13 +104,14 @@ Drive Atlas uses `/dev/disk/by-path/` to create persistent mappings between phys
|
||||
|
||||
**Example PCI path:**
|
||||
```
|
||||
pci-0000:0c:00.0-ata-1 → /dev/sda
|
||||
pci-0000:01:00.0-sas-phy6-lun-0 → /dev/sde → Bay 5
|
||||
```
|
||||
|
||||
This tells us:
|
||||
- `0000:0c:00.0` - PCI bus address of the storage controller
|
||||
- `ata-1` - Port 1 on that controller
|
||||
- Maps to physical bay 3 on compute-storage-01
|
||||
- `0000:01:00.0` - PCI bus address of the LSI SAS3008 HBA
|
||||
- `sas-phy6` - SAS PHY 6 on that controller
|
||||
- `lun-0` - Logical Unit Number
|
||||
- Maps to physical bay 5 on compute-storage-01
|
||||
|
||||
### Configuration
|
||||
|
||||
@@ -91,9 +120,10 @@ Server mappings are defined in the `SERVER_MAPPINGS` associative array in [drive
|
||||
```bash
|
||||
declare -A SERVER_MAPPINGS=(
|
||||
["compute-storage-01"]="
|
||||
pci-0000:0c:00.0-ata-1 3
|
||||
pci-0000:0c:00.0-ata-2 4
|
||||
pci-0000:0d:00.0-nvme-1 m2-1
|
||||
pci-0000:0d:00.0-ata-2 1
|
||||
pci-0000:0d:00.0-ata-1 2
|
||||
pci-0000:01:00.0-sas-phy6-lun-0 5
|
||||
pci-0000:0e:00.0-nvme-1 m2-1
|
||||
"
|
||||
)
|
||||
```
|
||||
@@ -115,10 +145,11 @@ This will show all available PCI paths and their associated drives.
|
||||
For each populated drive bay:
|
||||
|
||||
1. Note the physical bay number (labeled on chassis)
|
||||
2. Identify a unique characteristic (size, model, or serial number)
|
||||
3. Match it to the PCI path from the diagnostic output
|
||||
2. Run the main script to see serial numbers
|
||||
3. Match visible serial numbers on drives to the output
|
||||
4. Map PCI paths to bay numbers
|
||||
|
||||
**Pro tip:** If uncertain, remove one drive at a time and re-run the diagnostic to see which PCI path disappears.
|
||||
**Pro tip:** The script shows serial numbers - compare them to visible labels on drive trays to verify physical locations.
|
||||
|
||||
### Step 3: Create Mapping
|
||||
|
||||
@@ -152,30 +183,21 @@ Use debug mode to see the mappings:
|
||||
DEBUG=1 bash driveAtlas.sh
|
||||
```
|
||||
|
||||
## Output Example
|
||||
## Output Columns Explained
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ compute-storage-01 │
|
||||
│ 10-Bay Hot-swap Chassis │
|
||||
│ │
|
||||
│ M.2 NVMe Slot │
|
||||
│ ┌──────────┐ │
|
||||
│ │ nvme0n1 │ │
|
||||
│ └──────────┘ │
|
||||
│ │
|
||||
│ Front Hot-swap Bays │
|
||||
│ ┌──────────┐┌──────────┐┌──────────┐┌──────────┐... │
|
||||
│ │1: EMPTY ││2: EMPTY ││3: sda ││4: sdb │... │
|
||||
│ └──────────┘└──────────┘└──────────┘└──────────┘... │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
|
||||
=== Drive Details with SMART Status ===
|
||||
DEVICE SIZE TYPE TEMP HEALTH MODEL
|
||||
--------------------------------------------------------------------------------
|
||||
/dev/sda 2TB HDD 35°C ✓ WD20EFRX-68EUZN0
|
||||
/dev/nvme0n1 1TB SSD 42°C ✓ Samsung 980 PRO
|
||||
```
|
||||
| Column | Description |
|
||||
|--------|-------------|
|
||||
| **BAY** | Physical bay number (1-10, m2-1, etc.) |
|
||||
| **DEVICE** | Linux device name (/dev/sdX, /dev/nvmeXnY) |
|
||||
| **SIZE** | Drive capacity |
|
||||
| **TYPE** | SSD or HDD (detected via SMART) |
|
||||
| **TEMP** | Current temperature from SMART |
|
||||
| **HEALTH** | SMART health status (✓ = passed, ✗ = failed) |
|
||||
| **MODEL** | Drive model number |
|
||||
| **SERIAL** | Drive serial number (for physical verification) |
|
||||
| **CEPH OSD** | Ceph OSD ID if drive hosts an OSD |
|
||||
| **STATUS** | Ceph OSD status (up/in, down/out, etc.) |
|
||||
| **USAGE** | Mount point or "BOOT" for system drive |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
@@ -190,7 +212,7 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL
|
||||
- Even identical motherboards can have different PCI addressing
|
||||
- BIOS settings can affect PCI enumeration
|
||||
- HBA installation in different PCIe slots changes addresses
|
||||
- Cable routing to different SATA ports changes the ata-N number
|
||||
- Cable routing to different SATA ports changes the ata-N or phy-N number
|
||||
|
||||
### SMART data not showing
|
||||
|
||||
@@ -199,19 +221,32 @@ DEVICE SIZE TYPE TEMP HEALTH MODEL
|
||||
- USB-connected drives may not support SMART
|
||||
- Run `sudo smartctl -i /dev/sdX` manually to check
|
||||
|
||||
### Ceph OSD status shows "unknown/out"
|
||||
|
||||
- Ensure `ceph` and `ceph-volume` commands are available
|
||||
- Check if the Ceph cluster is healthy: `ceph -s`
|
||||
- Verify OSD is actually up: `ceph osd tree`
|
||||
|
||||
### Serial numbers don't match visible labels
|
||||
|
||||
- Some manufacturers use different serials for SMART vs. physical labels
|
||||
- Cross-reference by drive model and size
|
||||
- Use the removal method: power down, remove drive, check which bay becomes EMPTY
|
||||
|
||||
## Files
|
||||
|
||||
- [driveAtlas.sh](driveAtlas.sh) - Main script
|
||||
- [diagnose-drives.sh](diagnose-drives.sh) - PCI path diagnostic tool
|
||||
- [README.md](README.md) - This file
|
||||
- [todo.txt](todo.txt) - Development notes
|
||||
- [CLAUDE.md](CLAUDE.md) - AI-assisted development notes
|
||||
- [todo.txt](todo.txt) - Development notes and task tracking
|
||||
|
||||
## Contributing
|
||||
|
||||
When adding support for a new server:
|
||||
|
||||
1. Run `diagnose-drives.sh` and save output
|
||||
2. Physically label or identify drives
|
||||
2. Physically label or identify drives by serial number
|
||||
3. Create mapping in `SERVER_MAPPINGS`
|
||||
4. Test thoroughly
|
||||
5. Document any unique hardware configurations
|
||||
@@ -231,11 +266,15 @@ PCI paths are deterministic and based on physical hardware topology.
|
||||
|
||||
### Bay Numbering Conventions
|
||||
|
||||
- **10-bay chassis:** Bays numbered 1-10 (left to right, top to bottom)
|
||||
- **10-bay chassis:** Bays numbered 1-10 (left to right, typically)
|
||||
- **M.2 slots:** Labeled as `m2-1`, `m2-2`, etc.
|
||||
- **USB drives:** Labeled as `usb1`, `usb2`, etc.
|
||||
- **Large1:** Grid numbering 1-9 (3x3 displayed, additional bays documented in mapping)
|
||||
- **Large1:** Grid numbering 1-15 (documented in mapping)
|
||||
|
||||
## License
|
||||
### Ceph Integration
|
||||
|
||||
Internal tool for LotusGuild infrastructure.
|
||||
The script automatically detects Ceph OSDs using:
|
||||
1. `ceph-volume lvm list` to map devices to OSD IDs
|
||||
2. `ceph osd tree` to get up/down and in/out status
|
||||
|
||||
Status format: `up/in` means OSD is running and participating in the cluster.
|
||||
Reference in New Issue
Block a user