Files
gandalf/README.md

489 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GANDALF (Global Advanced Network Detection And Link Facilitator)
> Because it shall not let problems pass.
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.
**Design System**: [web_template](https://code.lotusguild.org/LotusGuild/web_template) — shared CSS, JS, and layout patterns for all LotusGuild apps
## Styling & Layout
GANDALF uses the **LotusGuild Terminal Design System**. For all styling, component, and layout documentation see:
- [`web_template/README.md`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/README.md) — full component reference, CSS variables, JS API
- [`web_template/base.css`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.css) — unified CSS (`.lt-*` classes)
- [`web_template/base.js`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.js) — `window.lt` utilities (toast, modal, auto-refresh, fetch helpers)
- [`web_template/python/base.html`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/base.html) — Jinja2 base template
- [`web_template/python/auth.py`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/auth.py) — `@require_auth` decorator pattern
---
## Architecture
Two processes share a MariaDB database:
| Process | Service | Role |
|---|---|---|
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
```
[Prometheus :9090] ──▶
[UniFi Controller] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker] ──▶
[SSH / ethtool] ──▶
```
### Data Sources
| Source | What it provides |
|---|---|
| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state + traffic/error rates via `node_exporter` |
| **UniFi API** (`https://10.10.10.1`) | Switch port stats, device status, LLDP neighbor table, PoE data |
| **Pulse Worker** | SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host |
| **Ping** | Reachability for hosts without `node_exporter` (e.g. PBS) |
### Monitored Hosts (Prometheus / node_exporter)
| Host | Prometheus Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |
Ping-only (no node_exporter): **pbs** (10.10.10.3)
---
## Pages
### Dashboard (`/`)
- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
- Network topology diagram (Internet → Gateway → Switches → Hosts)
- UniFi device table (switches, APs, gateway)
- Active alerts table with severity, target, consecutive failures, ticket link
- Quick-suppress modal: apply timed or manual suppression from any alert row
- Auto-refreshes every 30 seconds via `/api/status` + `/api/network`
### Link Debug (`/links`)
Per-interface statistics collected every poll cycle. All panels are collapsible
(click header or use Collapse All / Expand All). Collapse state persists across
page refreshes via `sessionStorage`.
**Server NICs** (via Prometheus + SSH/ethtool):
- Speed, duplex, auto-negotiation, link detected
- TX/RX rate bars (bandwidth utilisation % of link capacity)
- TX/RX error and drop rates per second
- Carrier changes (cumulative since boot — watch for flapping)
- **SFP / Optical panel** (when SFP module present): vendor/PN, temp, voltage,
bias current, TX power (dBm), RX power (dBm), RXTX delta, per-stat bars
**UniFi Switch Ports** (via UniFi API):
- Port number badge (`#N`), UPLINK badge, PoE draw badge
- LLDP neighbor line: `→ system_name (port_id)` when neighbor is detected
- PoE class and max wattage line
- Speed, duplex, auto-neg, TX/RX rates, errors, drops
### Inspector (`/inspector`)
Visual switch chassis diagrams. Each switch is rendered model-accurately using
layout config in the template (`SWITCH_LAYOUTS`).
**Port block colours:**
| Colour | State |
|---|---|
| Green | Up, no active PoE |
| Amber | Up with active PoE draw |
| Cyan | Uplink port (up) |
| Grey | Down |
| White outline | Currently selected |
**Clicking a port** opens the right-side detail panel showing:
- Link stats (status, speed, duplex, auto-neg, media type)
- PoE (class, max wattage, current draw, mode)
- Traffic (TX/RX rates)
- Errors/drops per second
- **LLDP Neighbor** section (system name, port ID, chassis ID, management IPs)
- **Path Debug** (auto-appears when LLDP `system_name` matches a known server):
two-column comparison of the switch port stats vs. the server NIC stats,
including SFP DOM data if the server side has an SFP module
**LLDP path debug requirements:**
1. Server must run `lldpd`: `apt install lldpd && systemctl enable --now lldpd`
2. `lldpd` hostname must match the key in `data.hosts` (set via `config.json → hosts`)
3. Switch has LLDP enabled (UniFi default: on)
**Supported switch models** (set `SWITCH_LAYOUTS` keys to your UniFi model codes):
| Key | Model | Layout |
|---|---|---|
| `USF5P` | UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink |
| `USL8A` | UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) |
| `US24PRO` | UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP |
| `USPPDUP` | Custom/other | Single-port fallback |
| `USMINI` | UniFi Switch Mini | 5-port row |
Add new layouts by adding a key to `SWITCH_LAYOUTS` matching the `model` field
returned by the UniFi API for that device.
### Suppressions (`/suppressions`)
- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
- Target types: host, interface, UniFi device, or global
- Active suppressions table with one-click removal
- Suppression history (last 50)
- Available targets reference grid (all known hosts + interfaces)
---
## Alert Logic
### Ticket Triggers
| Condition | Priority |
|---|---|
| UniFi device offline (≥2 consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
| Host unreachable via ping (≥2 consecutive checks) | P2 High |
| ≥3 hosts simultaneously reporting interface failures | P1 Critical |
### Baseline Tracking
Interfaces that are **down on first observation** (unused ports, unplugged cables)
are recorded as `initial_down` and never alerted. Only **UP→DOWN regressions**
generate tickets. Baseline is stored in MariaDB and survives daemon restarts.
### Suppression Targets
| Type | Suppresses |
|---|---|
| `host` | All interface alerts for a named host |
| `interface` | A specific NIC on a specific host |
| `unifi_device` | A specific UniFi device |
| `all` | Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire).
Expired suppressions are checked at evaluation time — no background cleanup needed.
---
## Configuration (`config.json`)
Shared by both processes. Located in the working directory (`/var/www/html/prod/`).
```json
{
"database": {
"host": "10.10.10.50",
"port": 3306,
"user": "gandalf",
"password": "...",
"name": "gandalf"
},
"prometheus": {
"url": "http://10.10.10.48:9090"
},
"unifi": {
"controller": "https://10.10.10.1",
"api_key": "...",
"site_id": "default"
},
"ticket_api": {
"url": "https://t.lotusguild.org/api/tickets",
"api_key": "..."
},
"pulse": {
"url": "http://<pulse-host>:<port>",
"api_key": "...",
"worker_id": "...",
"timeout": 45
},
"auth": {
"allowed_groups": ["admin"]
},
"hosts": [
{ "name": "large1", "prometheus_instance": "10.10.10.2:9100" },
{ "name": "compute-storage-01", "prometheus_instance": "10.10.10.4:9100" },
{ "name": "micro1", "prometheus_instance": "10.10.10.8:9100" },
{ "name": "monitor-02", "prometheus_instance": "10.10.10.9:9100" },
{ "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
{ "name": "storage-01", "prometheus_instance": "10.10.10.11:9100" }
],
"monitor": {
"poll_interval": 120,
"failure_threshold": 2,
"cluster_threshold": 3,
"ping_hosts": [
{ "name": "pbs", "ip": "10.10.10.3" }
]
}
}
```
### Key Config Fields
| Key | Description |
|---|---|
| `database.*` | MariaDB credentials (LXC 149 at 10.10.10.50) |
| `prometheus.url` | Prometheus base URL |
| `unifi.controller` | UniFi controller base URL (HTTPS, self-signed cert ignored) |
| `unifi.api_key` | UniFi API key from controller Settings → API |
| `unifi.site_id` | UniFi site ID (default: `default`) |
| `ticket_api.api_key` | Tinker Tickets bearer token |
| `pulse.url` | Pulse worker API base URL (for SSH relay) |
| `pulse.worker_id` | Which Pulse worker runs ethtool collection |
| `pulse.timeout` | Max seconds to wait for SSH collection per host |
| `auth.allowed_groups` | Authelia groups that may access Gandalf |
| `hosts` | Maps Prometheus instance labels → display hostnames |
| `monitor.poll_interval` | Seconds between full check cycles (default: 120) |
| `monitor.failure_threshold` | Consecutive failures before creating ticket (default: 2) |
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster-wide P1 (default: 3) |
| `monitor.ping_hosts` | Hosts checked only by ping (no node_exporter) |
---
## Deployment (LXC 157)
### 1. Database — MariaDB LXC 149 (`10.10.10.50`)
```sql
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
```
Import schema:
```bash
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
```
### 2. LXC 157 — Install dependencies
```bash
pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass
```
### 3. Deploy files
```bash
# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
static/style.css static/app.js \
templates/*.html; do
sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
"$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor
```
### 4. systemd services
**`gandalf.service`** (Flask/gunicorn web app):
```ini
[Unit]
Description=Gandalf Web Dashboard
After=network.target
[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always
[Install]
WantedBy=multi-user.target
```
**`gandalf-monitor.service`** (background polling daemon):
```ini
[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target
[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always
[Install]
WantedBy=multi-user.target
```
### 5. Authelia rule (LXC 167)
```yaml
access_control:
rules:
- domain: gandalf.lotusguild.org
policy: one_factor
subject:
- group:admin
```
```bash
systemctl restart authelia
```
### 6. NPM reverse proxy
- **Domain:** `gandalf.lotusguild.org`
- **Forward to:** `http://10.10.10.61:8000` (gunicorn direct, no nginx needed on LXC)
- **Forward Auth:** Authelia at `http://10.10.10.167:9091`
- **WebSockets:** Not required
---
## Service Management
```bash
# Status
systemctl status gandalf gandalf-monitor
# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f
# Restart after code or config changes
systemctl restart gandalf gandalf-monitor
```
---
## Troubleshooting
### Monitor not creating tickets
- Verify `config.json → ticket_api.api_key` is set and valid
- Check `journalctl -u gandalf-monitor` for `Ticket creation failed` lines
- Confirm the Tinker Tickets API is reachable from LXC 157
### Link Debug shows no data / "Loading…" forever
- Check `gandalf-monitor.service` is running and has completed at least one cycle
- Check `journalctl -u gandalf-monitor` for Prometheus or UniFi errors
- Verify Prometheus is reachable: `curl http://10.10.10.48:9090/api/v1/query?query=up`
### Link Debug: SFP DOM panel missing
- SFP data requires Pulse worker + SSH access to hosts
- Verify `config.json → pulse.*` is configured and the Pulse worker is running
- Confirm `sshpass` + SSH access from the Pulse worker to each Proxmox host
- Only interfaces with physical SFP modules return DOM data (`ethtool -m`)
### Inspector: path debug section not appearing
- Requires LLDP: run `apt install lldpd && systemctl enable --now lldpd` on each server
- The LLDP `system_name` broadcast by `lldpd` must match the hostname in `config.json → hosts[].name`
- Override: `echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd`
- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate
### Inspector: switch chassis shows as flat list (no layout)
- The switch's `model` field from UniFi doesn't match any key in `SWITCH_LAYOUTS` in `inspector.html`
- Check the UniFi API: the model appears in the `link_stats` API response under `unifi_switches.<name>.model`
- Add the model key to `SWITCH_LAYOUTS` in `inspector.html` with the correct row/SFP layout
### Baseline re-initializing on every restart
- `interface_baseline` is stored in the `monitor_state` DB table; survives restarts
- If it appears to reset: check DB connectivity from the monitor daemon
### Interface stuck at "initial_down" forever
- This means the interface was down when the monitor first saw it
- It will begin tracking once it comes up; or manually clear it:
```sql
-- In MariaDB on 10.10.10.50:
UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
```
Then restart the monitor: `systemctl restart gandalf-monitor`
### Prometheus data missing for a host
```bash
# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
```
---
## Development Notes
### File Layout
```
gandalf/
├── app.py # Flask web app (routes, auth, API endpoints)
├── monitor.py # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql # Database schema (network_events, suppression_rules, monitor_state)
├── config.json # Runtime configuration (not committed with secrets)
├── requirements.txt # Python dependencies
├── static/
│ ├── style.css # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│ └── app.js # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
├── base.html # Shared layout (header, nav, footer)
├── index.html # Dashboard page
├── links.html # Link Debug page (server NICs + UniFi switch ports)
├── inspector.html # Visual switch inspector + LLDP path debug
└── suppressions.html # Suppression management page
```
### Adding a New Monitored Host
1. Install `prometheus-node-exporter` on the host
2. Add a scrape target to Prometheus config
3. Add an entry to `config.json → hosts`:
```json
{ "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
```
4. Restart monitor: `systemctl restart gandalf-monitor`
5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker
### Adding a New Switch Layout (Inspector)
Find the UniFi model code for the switch (it appears in the `/api/links` JSON response
under `unifi_switches.<switch_name>.model`), then add to `SWITCH_LAYOUTS` in
`templates/inspector.html`:
```javascript
'MYNEWMODEL': {
rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]], // port_idx by row
sfp_section: [17, 18], // separate SFP cage ports (rendered below rows)
sfp_ports: [], // port_idx values that are SFP-type within rows
},
```
### Database Schema Notes
- `network_events`: one row per active event; `resolved_at` is set when recovered
- `suppression_rules`: `active=FALSE` when removed; `expires_at` checked at query time
- `monitor_state`: key/value store; `interface_baseline` and `link_stats` are JSON blobs
### Security Notes
- **XSS prevention**: all user-controlled data in dynamically generated HTML uses
`escHtml()` (JS) or Jinja2 auto-escaping (Python). Suppress buttons use `data-*`
attributes + a single delegated click listener rather than inline `onclick` with
interpolated strings.
- **Interface name validation**: `monitor.py` validates SSH interface names against
`^[a-zA-Z0-9_.@-]+$` before use, and additionally wraps them with `shlex.quote()`
for defense-in-depth.
- **DB parameters**: all SQL uses parameterised queries via pymysql — no string
concatenation into SQL.
- **Auth**: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
app additionally checks the `Remote-User` header via `@require_auth`.
### Known Limitations
- Single gunicorn worker (`--workers 1`) — required because `db.py` uses thread-local
connection reuse (one connection per thread). Multiple workers would each have their
own connection, which is fine, but the thread-local optimisation only helps within
one worker.
- No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
`SameSite=Strict` and the site being admin-only.
- SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
is delayed. The `pulse.timeout` config controls the max wait.
- UniFi LLDP data is only as fresh as the last monitor poll (120s default).