# GANDALF (Global Advanced Network Detection And Link Facilitator) > Because it shall not let problems pass. Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`. **Design System**: [web_template](https://code.lotusguild.org/LotusGuild/web_template) — shared CSS, JS, and layout patterns for all LotusGuild apps ## Styling & Layout GANDALF uses the **LotusGuild Terminal Design System**. For all styling, component, and layout documentation see: - [`web_template/README.md`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/README.md) — full component reference, CSS variables, JS API - [`web_template/base.css`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.css) — unified CSS (`.lt-*` classes) - [`web_template/base.js`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.js) — `window.lt` utilities (toast, modal, auto-refresh, fetch helpers) - [`web_template/python/base.html`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/base.html) — Jinja2 base template - [`web_template/python/auth.py`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/auth.py) — `@require_auth` decorator pattern --- ## Architecture Two processes share a MariaDB database: | Process | Service | Role | |---|---|---| | `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) | | `monitor.py` | `gandalf-monitor.service` | Background polling daemon | ``` [Prometheus :9090] ──▶ [UniFi Controller] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser [Pulse Worker] ──▶ [SSH / ethtool] ──▶ ``` ### Data Sources | Source | What it provides | |---|---| | **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state + traffic/error rates via `node_exporter` | | **UniFi API** (`https://10.10.10.1`) | Switch port stats, device status, LLDP neighbor table, PoE data | | **Pulse Worker** | SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host | | **Ping** | Reachability for hosts without `node_exporter` (e.g. PBS) | ### Monitored Hosts (Prometheus / node_exporter) | Host | Prometheus Instance | |---|---| | large1 | 10.10.10.2:9100 | | compute-storage-01 | 10.10.10.4:9100 | | micro1 | 10.10.10.8:9100 | | monitor-02 | 10.10.10.9:9100 | | compute-storage-gpu-01 | 10.10.10.10:9100 | | storage-01 | 10.10.10.11:9100 | Ping-only (no node_exporter): **pbs** (10.10.10.3) --- ## Pages ### Dashboard (`/`) - Real-time host status grid with per-NIC link state (UP / DOWN / degraded) - Network topology diagram (Internet → Gateway → Switches → Hosts) - UniFi device table (switches, APs, gateway) - Active alerts table with severity, target, consecutive failures, ticket link - Quick-suppress modal: apply timed or manual suppression from any alert row - Auto-refreshes every 30 seconds via `/api/status` + `/api/network` ### Link Debug (`/links`) Per-interface statistics collected every poll cycle. All panels are collapsible (click header or use Collapse All / Expand All). Collapse state persists across page refreshes via `sessionStorage`. **Server NICs** (via Prometheus + SSH/ethtool): - Speed, duplex, auto-negotiation, link detected - TX/RX rate bars (bandwidth utilisation % of link capacity) - TX/RX error and drop rates per second - Carrier changes (cumulative since boot — watch for flapping) - **SFP / Optical panel** (when SFP module present): vendor/PN, temp, voltage, bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars **UniFi Switch Ports** (via UniFi API): - Port number badge (`#N`), UPLINK badge, PoE draw badge - LLDP neighbor line: `→ system_name (port_id)` when neighbor is detected - PoE class and max wattage line - Speed, duplex, auto-neg, TX/RX rates, errors, drops ### Inspector (`/inspector`) Visual switch chassis diagrams. Each switch is rendered model-accurately using layout config in the template (`SWITCH_LAYOUTS`). **Port block colours:** | Colour | State | |---|---| | Green | Up, no active PoE | | Amber | Up with active PoE draw | | Cyan | Uplink port (up) | | Grey | Down | | White outline | Currently selected | **Clicking a port** opens the right-side detail panel showing: - Link stats (status, speed, duplex, auto-neg, media type) - PoE (class, max wattage, current draw, mode) - Traffic (TX/RX rates) - Errors/drops per second - **LLDP Neighbor** section (system name, port ID, chassis ID, management IPs) - **Path Debug** (auto-appears when LLDP `system_name` matches a known server): two-column comparison of the switch port stats vs. the server NIC stats, including SFP DOM data if the server side has an SFP module **LLDP path debug requirements:** 1. Server must run `lldpd`: `apt install lldpd && systemctl enable --now lldpd` 2. `lldpd` hostname must match the key in `data.hosts` (set via `config.json → hosts`) 3. Switch has LLDP enabled (UniFi default: on) **Supported switch models** (set `SWITCH_LAYOUTS` keys to your UniFi model codes): | Key | Model | Layout | |---|---|---| | `USF5P` | UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink | | `USL8A` | UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) | | `US24PRO` | UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP | | `USPPDUP` | Custom/other | Single-port fallback | | `USMINI` | UniFi Switch Mini | 5-port row | Add new layouts by adding a key to `SWITCH_LAYOUTS` matching the `model` field returned by the UniFi API for that device. ### Suppressions (`/suppressions`) - Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions - Target types: host, interface, UniFi device, or global - Active suppressions table with one-click removal - Suppression history (last 50) - Available targets reference grid (all known hosts + interfaces) --- ## Alert Logic ### Ticket Triggers | Condition | Priority | |---|---| | UniFi device offline (≥2 consecutive checks) | P2 High | | Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High | | Host unreachable via ping (≥2 consecutive checks) | P2 High | | ≥3 hosts simultaneously reporting interface failures | P1 Critical | ### Baseline Tracking Interfaces that are **down on first observation** (unused ports, unplugged cables) are recorded as `initial_down` and never alerted. Only **UP→DOWN regressions** generate tickets. Baseline is stored in MariaDB and survives daemon restarts. ### Suppression Targets | Type | Suppresses | |---|---| | `host` | All interface alerts for a named host | | `interface` | A specific NIC on a specific host | | `unifi_device` | A specific UniFi device | | `all` | Everything (global maintenance mode) | Suppressions can be manual (persist until removed) or timed (auto-expire). Expired suppressions are checked at evaluation time — no background cleanup needed. --- ## Configuration (`config.json`) Shared by both processes. Located in the working directory (`/var/www/html/prod/`). ```json { "database": { "host": "10.10.10.50", "port": 3306, "user": "gandalf", "password": "...", "name": "gandalf" }, "prometheus": { "url": "http://10.10.10.48:9090" }, "unifi": { "controller": "https://10.10.10.1", "api_key": "...", "site_id": "default" }, "ticket_api": { "url": "https://t.lotusguild.org/api/tickets", "api_key": "..." }, "pulse": { "url": "http://:", "api_key": "...", "worker_id": "...", "timeout": 45 }, "auth": { "allowed_groups": ["admin"] }, "hosts": [ { "name": "large1", "prometheus_instance": "10.10.10.2:9100" }, { "name": "compute-storage-01", "prometheus_instance": "10.10.10.4:9100" }, { "name": "micro1", "prometheus_instance": "10.10.10.8:9100" }, { "name": "monitor-02", "prometheus_instance": "10.10.10.9:9100" }, { "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" }, { "name": "storage-01", "prometheus_instance": "10.10.10.11:9100" } ], "monitor": { "poll_interval": 120, "failure_threshold": 2, "cluster_threshold": 3, "ping_hosts": [ { "name": "pbs", "ip": "10.10.10.3" } ] } } ``` ### Key Config Fields | Key | Description | |---|---| | `database.*` | MariaDB credentials (LXC 149 at 10.10.10.50) | | `prometheus.url` | Prometheus base URL | | `unifi.controller` | UniFi controller base URL (HTTPS, self-signed cert ignored) | | `unifi.api_key` | UniFi API key from controller Settings → API | | `unifi.site_id` | UniFi site ID (default: `default`) | | `ticket_api.api_key` | Tinker Tickets bearer token | | `pulse.url` | Pulse worker API base URL (for SSH relay) | | `pulse.worker_id` | Which Pulse worker runs ethtool collection | | `pulse.timeout` | Max seconds to wait for SSH collection per host | | `auth.allowed_groups` | Authelia groups that may access Gandalf | | `hosts` | Maps Prometheus instance labels → display hostnames | | `monitor.poll_interval` | Seconds between full check cycles (default: 120) | | `monitor.failure_threshold` | Consecutive failures before creating ticket (default: 2) | | `monitor.cluster_threshold` | Hosts with failures to trigger cluster-wide P1 (default: 3) | | `monitor.ping_hosts` | Hosts checked only by ping (no node_exporter) | --- ## Deployment (LXC 157) ### 1. Database — MariaDB LXC 149 (`10.10.10.50`) ```sql CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61'; FLUSH PRIVILEGES; ``` Import schema: ```bash mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql ``` ### 2. LXC 157 — Install dependencies ```bash pip3 install -r requirements.txt # Ensure sshpass is available (used by deploy scripts) apt install sshpass ``` ### 3. Deploy files ```bash # From dev machine / root/code/gandalf: for f in app.py db.py monitor.py config.json schema.sql \ static/style.css static/app.js \ templates/*.html; do sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \ "$f" "root@10.10.10.61:/var/www/html/prod/$f" done systemctl restart gandalf gandalf-monitor ``` ### 4. systemd services **`gandalf.service`** (Flask/gunicorn web app): ```ini [Unit] Description=Gandalf Web Dashboard After=network.target [Service] Type=simple WorkingDirectory=/var/www/html/prod ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app Restart=always [Install] WantedBy=multi-user.target ``` **`gandalf-monitor.service`** (background polling daemon): ```ini [Unit] Description=Gandalf Network Monitor Daemon After=network.target [Service] Type=simple WorkingDirectory=/var/www/html/prod ExecStart=/usr/bin/python3 monitor.py Restart=always [Install] WantedBy=multi-user.target ``` ### 5. Authelia rule (LXC 167) ```yaml access_control: rules: - domain: gandalf.lotusguild.org policy: one_factor subject: - group:admin ``` ```bash systemctl restart authelia ``` ### 6. NPM reverse proxy - **Domain:** `gandalf.lotusguild.org` - **Forward to:** `http://10.10.10.61:8000` (gunicorn direct, no nginx needed on LXC) - **Forward Auth:** Authelia at `http://10.10.10.167:9091` - **WebSockets:** Not required --- ## Service Management ```bash # Status systemctl status gandalf gandalf-monitor # Logs (live) journalctl -u gandalf -f journalctl -u gandalf-monitor -f # Restart after code or config changes systemctl restart gandalf gandalf-monitor ``` --- ## Troubleshooting ### Monitor not creating tickets - Verify `config.json → ticket_api.api_key` is set and valid - Check `journalctl -u gandalf-monitor` for `Ticket creation failed` lines - Confirm the Tinker Tickets API is reachable from LXC 157 ### Link Debug shows no data / "Loading…" forever - Check `gandalf-monitor.service` is running and has completed at least one cycle - Check `journalctl -u gandalf-monitor` for Prometheus or UniFi errors - Verify Prometheus is reachable: `curl http://10.10.10.48:9090/api/v1/query?query=up` ### Link Debug: SFP DOM panel missing - SFP data requires Pulse worker + SSH access to hosts - Verify `config.json → pulse.*` is configured and the Pulse worker is running - Confirm `sshpass` + SSH access from the Pulse worker to each Proxmox host - Only interfaces with physical SFP modules return DOM data (`ethtool -m`) ### Inspector: path debug section not appearing - Requires LLDP: run `apt install lldpd && systemctl enable --now lldpd` on each server - The LLDP `system_name` broadcast by `lldpd` must match the hostname in `config.json → hosts[].name` - Override: `echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd` - Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate ### Inspector: switch chassis shows as flat list (no layout) - The switch's `model` field from UniFi doesn't match any key in `SWITCH_LAYOUTS` in `inspector.html` - Check the UniFi API: the model appears in the `link_stats` API response under `unifi_switches..model` - Add the model key to `SWITCH_LAYOUTS` in `inspector.html` with the correct row/SFP layout ### Baseline re-initializing on every restart - `interface_baseline` is stored in the `monitor_state` DB table; survives restarts - If it appears to reset: check DB connectivity from the monitor daemon ### Interface stuck at "initial_down" forever - This means the interface was down when the monitor first saw it - It will begin tracking once it comes up; or manually clear it: ```sql -- In MariaDB on 10.10.10.50: UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline'; ``` Then restart the monitor: `systemctl restart gandalf-monitor` ### Prometheus data missing for a host ```bash # On the affected host: systemctl status prometheus-node-exporter # Verify it's scraped: curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")' ``` --- ## Development Notes ### File Layout ``` gandalf/ ├── app.py # Flask web app (routes, auth, API endpoints) ├── monitor.py # Background daemon (Prometheus, UniFi, Pulse, alert logic) ├── db.py # Database operations (MariaDB via pymysql, thread-local conn reuse) ├── schema.sql # Database schema (network_events, suppression_rules, monitor_state) ├── config.json # Runtime configuration (not committed with secrets) ├── requirements.txt # Python dependencies ├── static/ │ ├── style.css # Terminal aesthetic CSS (CRT scanlines, green-on-black) │ └── app.js # Dashboard JS (auto-refresh, host grid, events, suppress modal) └── templates/ ├── base.html # Shared layout (header, nav, footer) ├── index.html # Dashboard page ├── links.html # Link Debug page (server NICs + UniFi switch ports) ├── inspector.html # Visual switch inspector + LLDP path debug └── suppressions.html # Suppression management page ``` ### Adding a New Monitored Host 1. Install `prometheus-node-exporter` on the host 2. Add a scrape target to Prometheus config 3. Add an entry to `config.json → hosts`: ```json { "name": "newhost", "prometheus_instance": "10.10.10.X:9100" } ``` 4. Restart monitor: `systemctl restart gandalf-monitor` 5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker ### Adding a New Switch Layout (Inspector) Find the UniFi model code for the switch (it appears in the `/api/links` JSON response under `unifi_switches..model`), then add to `SWITCH_LAYOUTS` in `templates/inspector.html`: ```javascript 'MYNEWMODEL': { rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]], // port_idx by row sfp_section: [17, 18], // separate SFP cage ports (rendered below rows) sfp_ports: [], // port_idx values that are SFP-type within rows }, ``` ### Database Schema Notes - `network_events`: one row per active event; `resolved_at` is set when recovered - `suppression_rules`: `active=FALSE` when removed; `expires_at` checked at query time - `monitor_state`: key/value store; `interface_baseline` and `link_stats` are JSON blobs ### Security Notes - **XSS prevention**: all user-controlled data in dynamically generated HTML uses `escHtml()` (JS) or Jinja2 auto-escaping (Python). Suppress buttons use `data-*` attributes + a single delegated click listener rather than inline `onclick` with interpolated strings. - **Interface name validation**: `monitor.py` validates SSH interface names against `^[a-zA-Z0-9_.@-]+$` before use, and additionally wraps them with `shlex.quote()` for defense-in-depth. - **DB parameters**: all SQL uses parameterised queries via pymysql — no string concatenation into SQL. - **Auth**: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask app additionally checks the `Remote-User` header via `@require_auth`. ### Known Limitations - Single gunicorn worker (`--workers 1`) — required because `db.py` uses thread-local connection reuse (one connection per thread). Multiple workers would each have their own connection, which is fine, but the thread-local optimisation only helps within one worker. - No CSRF tokens on API endpoints — mitigated by Authelia session cookies being `SameSite=Strict` and the site being admin-only. - SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle is delayed. The `pulse.timeout` config controls the max wait. - UniFi LLDP data is only as fresh as the last monitor poll (120s default).