Jared Vititoe d576a0fe2d Auto-reload on auth timeout (401 response)
Wrap window.fetch so any 401 triggers window.location.reload(),
sending the browser back through the Authelia proxy to the login page.
Covers all pages since app.js is loaded by base.html.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 20:43:08 -04:00

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass.

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.

Design System: web_template — shared CSS, JS, and layout patterns for all LotusGuild apps

Styling & Layout

GANDALF uses the LotusGuild Terminal Design System. For all styling, component, and layout documentation see:


Architecture

Two processes share a MariaDB database:

Process Service Role
app.py gandalf.service Flask web dashboard (gunicorn, port 8000)
monitor.py gandalf-monitor.service Background polling daemon
[Prometheus :9090]  ──▶
[UniFi Controller]  ──▶  monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker]      ──▶
[SSH / ethtool]     ──▶

Data Sources

Source What it provides
Prometheus (10.10.10.48:9090) Physical NIC link state + traffic/error rates via node_exporter
UniFi API (https://10.10.10.1) Switch port stats, device status, LLDP neighbor table, PoE data
Pulse Worker SSH relay — runs ethtool + SFP DOM queries on each Proxmox host
Ping Reachability for hosts without node_exporter (e.g. PBS)

Monitored Hosts (Prometheus / node_exporter)

Host Prometheus Instance
large1 10.10.10.2:9100
compute-storage-01 10.10.10.4:9100
micro1 10.10.10.8:9100
monitor-02 10.10.10.9:9100
compute-storage-gpu-01 10.10.10.10:9100
storage-01 10.10.10.11:9100

Ping-only (no node_exporter): pbs (10.10.10.3)


Pages

Dashboard (/)

  • Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
  • Network topology diagram (Internet → Gateway → Switches → Hosts)
  • UniFi device table (switches, APs, gateway)
  • Active alerts table with severity, target, consecutive failures, ticket link
  • Quick-suppress modal: apply timed or manual suppression from any alert row
  • Auto-refreshes every 30 seconds via /api/status + /api/network

Per-interface statistics collected every poll cycle. All panels are collapsible (click header or use Collapse All / Expand All). Collapse state persists across page refreshes via sessionStorage.

Server NICs (via Prometheus + SSH/ethtool):

  • Speed, duplex, auto-negotiation, link detected
  • TX/RX rate bars (bandwidth utilisation % of link capacity)
  • TX/RX error and drop rates per second
  • Carrier changes (cumulative since boot — watch for flapping)
  • SFP / Optical panel (when SFP module present): vendor/PN, temp, voltage, bias current, TX power (dBm), RX power (dBm), RXTX delta, per-stat bars

UniFi Switch Ports (via UniFi API):

  • Port number badge (#N), UPLINK badge, PoE draw badge
  • LLDP neighbor line: → system_name (port_id) when neighbor is detected
  • PoE class and max wattage line
  • Speed, duplex, auto-neg, TX/RX rates, errors, drops

Inspector (/inspector)

Visual switch chassis diagrams. Each switch is rendered model-accurately using layout config in the template (SWITCH_LAYOUTS).

Port block colours:

Colour State
Green Up, no active PoE
Amber Up with active PoE draw
Cyan Uplink port (up)
Grey Down
White outline Currently selected

Clicking a port opens the right-side detail panel showing:

  • Link stats (status, speed, duplex, auto-neg, media type)
  • PoE (class, max wattage, current draw, mode)
  • Traffic (TX/RX rates)
  • Errors/drops per second
  • LLDP Neighbor section (system name, port ID, chassis ID, management IPs)
  • Path Debug (auto-appears when LLDP system_name matches a known server): two-column comparison of the switch port stats vs. the server NIC stats, including SFP DOM data if the server side has an SFP module

LLDP path debug requirements:

  1. Server must run lldpd: apt install lldpd && systemctl enable --now lldpd
  2. lldpd hostname must match the key in data.hosts (set via config.json → hosts)
  3. Switch has LLDP enabled (UniFi default: on)

Supported switch models (set SWITCH_LAYOUTS keys to your UniFi model codes):

Key Model Layout
USF5P UniFi Switch Flex 5 PoE 4×RJ45 + 1×SFP uplink
USL8A UniFi Switch Lite 8 PoE 8×SFP (2 rows of 4)
US24PRO UniFi Switch Pro 24 24×RJ45 staggered + 2×SFP
USPPDUP Custom/other Single-port fallback
USMINI UniFi Switch Mini 5-port row

Add new layouts by adding a key to SWITCH_LAYOUTS matching the model field returned by the UniFi API for that device.

Suppressions (/suppressions)

  • Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
  • Target types: host, interface, UniFi device, or global
  • Active suppressions table with one-click removal
  • Suppression history (last 50)
  • Available targets reference grid (all known hosts + interfaces)

Alert Logic

Ticket Triggers

Condition Priority
UniFi device offline (≥2 consecutive checks) P2 High
Proxmox host NIC link-down regression (≥2 consecutive checks) P2 High
Host unreachable via ping (≥2 consecutive checks) P2 High
≥3 hosts simultaneously reporting interface failures P1 Critical

Baseline Tracking

Interfaces that are down on first observation (unused ports, unplugged cables) are recorded as initial_down and never alerted. Only UP→DOWN regressions generate tickets. Baseline is stored in MariaDB and survives daemon restarts.

Suppression Targets

Type Suppresses
host All interface alerts for a named host
interface A specific NIC on a specific host
unifi_device A specific UniFi device
all Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire). Expired suppressions are checked at evaluation time — no background cleanup needed.


Configuration (config.json)

Shared by both processes. Located in the working directory (/var/www/html/prod/).

{
  "database": {
    "host": "10.10.10.50",
    "port": 3306,
    "user": "gandalf",
    "password": "...",
    "name": "gandalf"
  },
  "prometheus": {
    "url": "http://10.10.10.48:9090"
  },
  "unifi": {
    "controller": "https://10.10.10.1",
    "api_key": "...",
    "site_id": "default"
  },
  "ticket_api": {
    "url": "https://t.lotusguild.org/api/tickets",
    "api_key": "..."
  },
  "pulse": {
    "url": "http://<pulse-host>:<port>",
    "api_key": "...",
    "worker_id": "...",
    "timeout": 45
  },
  "auth": {
    "allowed_groups": ["admin"]
  },
  "hosts": [
    { "name": "large1",               "prometheus_instance": "10.10.10.2:9100" },
    { "name": "compute-storage-01",   "prometheus_instance": "10.10.10.4:9100" },
    { "name": "micro1",               "prometheus_instance": "10.10.10.8:9100" },
    { "name": "monitor-02",           "prometheus_instance": "10.10.10.9:9100" },
    { "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
    { "name": "storage-01",           "prometheus_instance": "10.10.10.11:9100" }
  ],
  "monitor": {
    "poll_interval": 120,
    "failure_threshold": 2,
    "cluster_threshold": 3,
    "ping_hosts": [
      { "name": "pbs", "ip": "10.10.10.3" }
    ]
  }
}

Key Config Fields

Key Description
database.* MariaDB credentials (LXC 149 at 10.10.10.50)
prometheus.url Prometheus base URL
unifi.controller UniFi controller base URL (HTTPS, self-signed cert ignored)
unifi.api_key UniFi API key from controller Settings → API
unifi.site_id UniFi site ID (default: default)
ticket_api.api_key Tinker Tickets bearer token
pulse.url Pulse worker API base URL (for SSH relay)
pulse.worker_id Which Pulse worker runs ethtool collection
pulse.timeout Max seconds to wait for SSH collection per host
auth.allowed_groups Authelia groups that may access Gandalf
hosts Maps Prometheus instance labels → display hostnames
monitor.poll_interval Seconds between full check cycles (default: 120)
monitor.failure_threshold Consecutive failures before creating ticket (default: 2)
monitor.cluster_threshold Hosts with failures to trigger cluster-wide P1 (default: 3)
monitor.ping_hosts Hosts checked only by ping (no node_exporter)

Deployment (LXC 157)

1. Database — MariaDB LXC 149 (10.10.10.50)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Import schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 — Install dependencies

pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass

3. Deploy files

# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
          static/style.css static/app.js \
          templates/*.html; do
  sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
    "$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor

4. systemd services

gandalf.service (Flask/gunicorn web app):

[Unit]
Description=Gandalf Web Dashboard
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always

[Install]
WantedBy=multi-user.target

gandalf-monitor.service (background polling daemon):

[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Authelia rule (LXC 167)

access_control:
  rules:
    - domain: gandalf.lotusguild.org
      policy: one_factor
      subject:
        - group:admin
systemctl restart authelia

6. NPM reverse proxy

  • Domain: gandalf.lotusguild.org
  • Forward to: http://10.10.10.61:8000 (gunicorn direct, no nginx needed on LXC)
  • Forward Auth: Authelia at http://10.10.10.167:9091
  • WebSockets: Not required

Service Management

# Status
systemctl status gandalf gandalf-monitor

# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f

# Restart after code or config changes
systemctl restart gandalf gandalf-monitor

Troubleshooting

Monitor not creating tickets

  • Verify config.json → ticket_api.api_key is set and valid
  • Check journalctl -u gandalf-monitor for Ticket creation failed lines
  • Confirm the Tinker Tickets API is reachable from LXC 157
  • Check gandalf-monitor.service is running and has completed at least one cycle
  • Check journalctl -u gandalf-monitor for Prometheus or UniFi errors
  • Verify Prometheus is reachable: curl http://10.10.10.48:9090/api/v1/query?query=up
  • SFP data requires Pulse worker + SSH access to hosts
  • Verify config.json → pulse.* is configured and the Pulse worker is running
  • Confirm sshpass + SSH access from the Pulse worker to each Proxmox host
  • Only interfaces with physical SFP modules return DOM data (ethtool -m)

Inspector: path debug section not appearing

  • Requires LLDP: run apt install lldpd && systemctl enable --now lldpd on each server
  • The LLDP system_name broadcast by lldpd must match the hostname in config.json → hosts[].name
    • Override: echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd
  • Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate

Inspector: switch chassis shows as flat list (no layout)

  • The switch's model field from UniFi doesn't match any key in SWITCH_LAYOUTS in inspector.html
  • Check the UniFi API: the model appears in the link_stats API response under unifi_switches.<name>.model
  • Add the model key to SWITCH_LAYOUTS in inspector.html with the correct row/SFP layout

Baseline re-initializing on every restart

  • interface_baseline is stored in the monitor_state DB table; survives restarts
  • If it appears to reset: check DB connectivity from the monitor daemon

Interface stuck at "initial_down" forever

  • This means the interface was down when the monitor first saw it
  • It will begin tracking once it comes up; or manually clear it:
    -- In MariaDB on 10.10.10.50:
    UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
    
    Then restart the monitor: systemctl restart gandalf-monitor

Prometheus data missing for a host

# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'

Development Notes

File Layout

gandalf/
├── app.py              # Flask web app (routes, auth, API endpoints)
├── monitor.py          # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py               # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql          # Database schema (network_events, suppression_rules, monitor_state)
├── config.json         # Runtime configuration (not committed with secrets)
├── requirements.txt    # Python dependencies
├── static/
│   ├── style.css       # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│   └── app.js          # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
    ├── base.html       # Shared layout (header, nav, footer)
    ├── index.html      # Dashboard page
    ├── links.html      # Link Debug page (server NICs + UniFi switch ports)
    ├── inspector.html  # Visual switch inspector + LLDP path debug
    └── suppressions.html # Suppression management page

Adding a New Monitored Host

  1. Install prometheus-node-exporter on the host
  2. Add a scrape target to Prometheus config
  3. Add an entry to config.json → hosts:
    { "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
    
  4. Restart monitor: systemctl restart gandalf-monitor
  5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker

Adding a New Switch Layout (Inspector)

Find the UniFi model code for the switch (it appears in the /api/links JSON response under unifi_switches.<switch_name>.model), then add to SWITCH_LAYOUTS in templates/inspector.html:

'MYNEWMODEL': {
  rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]],  // port_idx by row
  sfp_section: [17, 18],  // separate SFP cage ports (rendered below rows)
  sfp_ports: [],          // port_idx values that are SFP-type within rows
},

Database Schema Notes

  • network_events: one row per active event; resolved_at is set when recovered
  • suppression_rules: active=FALSE when removed; expires_at checked at query time
  • monitor_state: key/value store; interface_baseline and link_stats are JSON blobs

Security Notes

  • XSS prevention: all user-controlled data in dynamically generated HTML uses escHtml() (JS) or Jinja2 auto-escaping (Python). Suppress buttons use data-* attributes + a single delegated click listener rather than inline onclick with interpolated strings.
  • Interface name validation: monitor.py validates SSH interface names against ^[a-zA-Z0-9_.@-]+$ before use, and additionally wraps them with shlex.quote() for defense-in-depth.
  • DB parameters: all SQL uses parameterised queries via pymysql — no string concatenation into SQL.
  • Auth: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask app additionally checks the Remote-User header via @require_auth.

Known Limitations

  • Single gunicorn worker (--workers 1) — required because db.py uses thread-local connection reuse (one connection per thread). Multiple workers would each have their own connection, which is fine, but the thread-local optimisation only helps within one worker.
  • No CSRF tokens on API endpoints — mitigated by Authelia session cookies being SameSite=Strict and the site being admin-only.
  • SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle is delayed. The pulse.timeout config controls the max wait.
  • UniFi LLDP data is only as fresh as the last monitor poll (120s default).
Description
GANDALF (Global Advanced Network Detection And Link Facilitator)
Readme 827 KiB
Languages
HTML 35.9%
Python 34.3%
CSS 24.6%
JavaScript 5.2%