Jared Vititoe e779b21db4 feat: redesign network topology diagram with accurate rack layout
Replace linear Internet→UDM→Agg→PoE→all-hosts chain with accurate topology:
- USW-Aggregation and Pro 24 PoE switch shown side-by-side with horizontal
  10G SFP+ link between them (not in series)
- 5 compute/storage/monitor nodes fanned out under Agg Switch with 10G labels
  and rack unit positions (RU4–12, RU14–17) as sublabels
- large1 shown separately under PoE switch, dashed border = off-rack (table)
- Add device specs as subtitles on all nodes (Dream Machine Pro · RU24, etc.)
- Shorter display names: csg-01 / cs-01 instead of full hostnames
- Live status badges still updated by JS via data-host attributes
- New CSS: .topo-node-sub, .topo-switch-tier, .topo-h-link, .topo-host-tier,
  .topo-host-table (dashed), .topo-badge-unknown

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 22:06:03 -04:00

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass.

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.

Design System: web_template — shared CSS, JS, and layout patterns for all LotusGuild apps

Styling & Layout

GANDALF uses the LotusGuild Terminal Design System. For all styling, component, and layout documentation see:


Architecture

Two processes share a MariaDB database:

Process Service Role
app.py gandalf.service Flask web dashboard (gunicorn, port 8000)
monitor.py gandalf-monitor.service Background polling daemon
[Prometheus :9090]  ──▶
[UniFi Controller]  ──▶  monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker]      ──▶
[SSH / ethtool]     ──▶

Data Sources

Source What it provides
Prometheus (10.10.10.48:9090) Physical NIC link state + traffic/error rates via node_exporter
UniFi API (https://10.10.10.1) Switch port stats, device status, LLDP neighbor table, PoE data
Pulse Worker SSH relay — runs ethtool + SFP DOM queries on each Proxmox host
Ping Reachability for hosts without node_exporter (e.g. PBS)

Monitored Hosts (Prometheus / node_exporter)

Host Prometheus Instance
large1 10.10.10.2:9100
compute-storage-01 10.10.10.4:9100
micro1 10.10.10.8:9100
monitor-02 10.10.10.9:9100
compute-storage-gpu-01 10.10.10.10:9100
storage-01 10.10.10.11:9100

Ping-only (no node_exporter): pbs (10.10.10.3)


Pages

Dashboard (/)

  • Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
  • Network topology diagram (Internet → Gateway → Switches → Hosts)
  • UniFi device table (switches, APs, gateway)
  • Active alerts table with severity, target, consecutive failures, ticket link
  • Quick-suppress modal: apply timed or manual suppression from any alert row
  • Auto-refreshes every 30 seconds via /api/status + /api/network

Per-interface statistics collected every poll cycle. All panels are collapsible (click header or use Collapse All / Expand All). Collapse state persists across page refreshes via sessionStorage.

Server NICs (via Prometheus + SSH/ethtool):

  • Speed, duplex, auto-negotiation, link detected
  • TX/RX rate bars (bandwidth utilisation % of link capacity)
  • TX/RX error and drop rates per second
  • Carrier changes (cumulative since boot — watch for flapping)
  • SFP / Optical panel (when SFP module present): vendor/PN, temp, voltage, bias current, TX power (dBm), RX power (dBm), RXTX delta, per-stat bars

UniFi Switch Ports (via UniFi API):

  • Port number badge (#N), UPLINK badge, PoE draw badge
  • LLDP neighbor line: → system_name (port_id) when neighbor is detected
  • PoE class and max wattage line
  • Speed, duplex, auto-neg, TX/RX rates, errors, drops

Inspector (/inspector)

Visual switch chassis diagrams. Each switch is rendered model-accurately using layout config in the template (SWITCH_LAYOUTS).

Port block colours:

Colour State
Green Up, no active PoE
Amber Up with active PoE draw
Cyan Uplink port (up)
Grey Down
White outline Currently selected

Clicking a port opens the right-side detail panel showing:

  • Link stats (status, speed, duplex, auto-neg, media type)
  • PoE (class, max wattage, current draw, mode)
  • Traffic (TX/RX rates)
  • Errors/drops per second
  • LLDP Neighbor section (system name, port ID, chassis ID, management IPs)
  • Path Debug (auto-appears when LLDP system_name matches a known server): two-column comparison of the switch port stats vs. the server NIC stats, including SFP DOM data if the server side has an SFP module

LLDP path debug requirements:

  1. Server must run lldpd: apt install lldpd && systemctl enable --now lldpd
  2. lldpd hostname must match the key in data.hosts (set via config.json → hosts)
  3. Switch has LLDP enabled (UniFi default: on)

Supported switch models (set SWITCH_LAYOUTS keys to your UniFi model codes):

Key Model Layout
USF5P UniFi Switch Flex 5 PoE 4×RJ45 + 1×SFP uplink
USL8A UniFi Switch Lite 8 PoE 8×SFP (2 rows of 4)
US24PRO UniFi Switch Pro 24 24×RJ45 staggered + 2×SFP
USPPDUP Custom/other Single-port fallback
USMINI UniFi Switch Mini 5-port row

Add new layouts by adding a key to SWITCH_LAYOUTS matching the model field returned by the UniFi API for that device.

Suppressions (/suppressions)

  • Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
  • Target types: host, interface, UniFi device, or global
  • Active suppressions table with one-click removal
  • Suppression history (last 50)
  • Available targets reference grid (all known hosts + interfaces)

Alert Logic

Ticket Triggers

Condition Priority
UniFi device offline (≥2 consecutive checks) P2 High
Proxmox host NIC link-down regression (≥2 consecutive checks) P2 High
Host unreachable via ping (≥2 consecutive checks) P2 High
≥3 hosts simultaneously reporting interface failures P1 Critical

Baseline Tracking

Interfaces that are down on first observation (unused ports, unplugged cables) are recorded as initial_down and never alerted. Only UP→DOWN regressions generate tickets. Baseline is stored in MariaDB and survives daemon restarts.

Suppression Targets

Type Suppresses
host All interface alerts for a named host
interface A specific NIC on a specific host
unifi_device A specific UniFi device
all Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire). Expired suppressions are checked at evaluation time — no background cleanup needed.


Configuration (config.json)

Shared by both processes. Located in the working directory (/var/www/html/prod/).

{
  "database": {
    "host": "10.10.10.50",
    "port": 3306,
    "user": "gandalf",
    "password": "...",
    "name": "gandalf"
  },
  "prometheus": {
    "url": "http://10.10.10.48:9090"
  },
  "unifi": {
    "controller": "https://10.10.10.1",
    "api_key": "...",
    "site_id": "default"
  },
  "ticket_api": {
    "url": "https://t.lotusguild.org/api/tickets",
    "api_key": "..."
  },
  "pulse": {
    "url": "http://<pulse-host>:<port>",
    "api_key": "...",
    "worker_id": "...",
    "timeout": 45
  },
  "auth": {
    "allowed_groups": ["admin"]
  },
  "hosts": [
    { "name": "large1",               "prometheus_instance": "10.10.10.2:9100" },
    { "name": "compute-storage-01",   "prometheus_instance": "10.10.10.4:9100" },
    { "name": "micro1",               "prometheus_instance": "10.10.10.8:9100" },
    { "name": "monitor-02",           "prometheus_instance": "10.10.10.9:9100" },
    { "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
    { "name": "storage-01",           "prometheus_instance": "10.10.10.11:9100" }
  ],
  "monitor": {
    "poll_interval": 120,
    "failure_threshold": 2,
    "cluster_threshold": 3,
    "ping_hosts": [
      { "name": "pbs", "ip": "10.10.10.3" }
    ]
  }
}

Key Config Fields

Key Description
database.* MariaDB credentials (LXC 149 at 10.10.10.50)
prometheus.url Prometheus base URL
unifi.controller UniFi controller base URL (HTTPS, self-signed cert ignored)
unifi.api_key UniFi API key from controller Settings → API
unifi.site_id UniFi site ID (default: default)
ticket_api.api_key Tinker Tickets bearer token
pulse.url Pulse worker API base URL (for SSH relay)
pulse.worker_id Which Pulse worker runs ethtool collection
pulse.timeout Max seconds to wait for SSH collection per host
auth.allowed_groups Authelia groups that may access Gandalf
hosts Maps Prometheus instance labels → display hostnames
monitor.poll_interval Seconds between full check cycles (default: 120)
monitor.failure_threshold Consecutive failures before creating ticket (default: 2)
monitor.cluster_threshold Hosts with failures to trigger cluster-wide P1 (default: 3)
monitor.ping_hosts Hosts checked only by ping (no node_exporter)

Deployment (LXC 157)

1. Database — MariaDB LXC 149 (10.10.10.50)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Import schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 — Install dependencies

pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass

3. Deploy files

# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
          static/style.css static/app.js \
          templates/*.html; do
  sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
    "$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor

4. systemd services

gandalf.service (Flask/gunicorn web app):

[Unit]
Description=Gandalf Web Dashboard
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always

[Install]
WantedBy=multi-user.target

gandalf-monitor.service (background polling daemon):

[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Authelia rule (LXC 167)

access_control:
  rules:
    - domain: gandalf.lotusguild.org
      policy: one_factor
      subject:
        - group:admin
systemctl restart authelia

6. NPM reverse proxy

  • Domain: gandalf.lotusguild.org
  • Forward to: http://10.10.10.61:8000 (gunicorn direct, no nginx needed on LXC)
  • Forward Auth: Authelia at http://10.10.10.167:9091
  • WebSockets: Not required

Service Management

# Status
systemctl status gandalf gandalf-monitor

# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f

# Restart after code or config changes
systemctl restart gandalf gandalf-monitor

Troubleshooting

Monitor not creating tickets

  • Verify config.json → ticket_api.api_key is set and valid
  • Check journalctl -u gandalf-monitor for Ticket creation failed lines
  • Confirm the Tinker Tickets API is reachable from LXC 157
  • Check gandalf-monitor.service is running and has completed at least one cycle
  • Check journalctl -u gandalf-monitor for Prometheus or UniFi errors
  • Verify Prometheus is reachable: curl http://10.10.10.48:9090/api/v1/query?query=up
  • SFP data requires Pulse worker + SSH access to hosts
  • Verify config.json → pulse.* is configured and the Pulse worker is running
  • Confirm sshpass + SSH access from the Pulse worker to each Proxmox host
  • Only interfaces with physical SFP modules return DOM data (ethtool -m)

Inspector: path debug section not appearing

  • Requires LLDP: run apt install lldpd && systemctl enable --now lldpd on each server
  • The LLDP system_name broadcast by lldpd must match the hostname in config.json → hosts[].name
    • Override: echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd
  • Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate

Inspector: switch chassis shows as flat list (no layout)

  • The switch's model field from UniFi doesn't match any key in SWITCH_LAYOUTS in inspector.html
  • Check the UniFi API: the model appears in the link_stats API response under unifi_switches.<name>.model
  • Add the model key to SWITCH_LAYOUTS in inspector.html with the correct row/SFP layout

Baseline re-initializing on every restart

  • interface_baseline is stored in the monitor_state DB table; survives restarts
  • If it appears to reset: check DB connectivity from the monitor daemon

Interface stuck at "initial_down" forever

  • This means the interface was down when the monitor first saw it
  • It will begin tracking once it comes up; or manually clear it:
    -- In MariaDB on 10.10.10.50:
    UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
    
    Then restart the monitor: systemctl restart gandalf-monitor

Prometheus data missing for a host

# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'

Development Notes

File Layout

gandalf/
├── app.py              # Flask web app (routes, auth, API endpoints)
├── monitor.py          # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py               # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql          # Database schema (network_events, suppression_rules, monitor_state)
├── config.json         # Runtime configuration (not committed with secrets)
├── requirements.txt    # Python dependencies
├── static/
│   ├── style.css       # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│   └── app.js          # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
    ├── base.html       # Shared layout (header, nav, footer)
    ├── index.html      # Dashboard page
    ├── links.html      # Link Debug page (server NICs + UniFi switch ports)
    ├── inspector.html  # Visual switch inspector + LLDP path debug
    └── suppressions.html # Suppression management page

Adding a New Monitored Host

  1. Install prometheus-node-exporter on the host
  2. Add a scrape target to Prometheus config
  3. Add an entry to config.json → hosts:
    { "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
    
  4. Restart monitor: systemctl restart gandalf-monitor
  5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker

Adding a New Switch Layout (Inspector)

Find the UniFi model code for the switch (it appears in the /api/links JSON response under unifi_switches.<switch_name>.model), then add to SWITCH_LAYOUTS in templates/inspector.html:

'MYNEWMODEL': {
  rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]],  // port_idx by row
  sfp_section: [17, 18],  // separate SFP cage ports (rendered below rows)
  sfp_ports: [],          // port_idx values that are SFP-type within rows
},

Database Schema Notes

  • network_events: one row per active event; resolved_at is set when recovered
  • suppression_rules: active=FALSE when removed; expires_at checked at query time
  • monitor_state: key/value store; interface_baseline and link_stats are JSON blobs

Security Notes

  • XSS prevention: all user-controlled data in dynamically generated HTML uses escHtml() (JS) or Jinja2 auto-escaping (Python). Suppress buttons use data-* attributes + a single delegated click listener rather than inline onclick with interpolated strings.
  • Interface name validation: monitor.py validates SSH interface names against ^[a-zA-Z0-9_.@-]+$ before use, and additionally wraps them with shlex.quote() for defense-in-depth.
  • DB parameters: all SQL uses parameterised queries via pymysql — no string concatenation into SQL.
  • Auth: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask app additionally checks the Remote-User header via @require_auth.

Known Limitations

  • Single gunicorn worker (--workers 1) — required because db.py uses thread-local connection reuse (one connection per thread). Multiple workers would each have their own connection, which is fine, but the thread-local optimisation only helps within one worker.
  • No CSRF tokens on API endpoints — mitigated by Authelia session cookies being SameSite=Strict and the site being admin-only.
  • SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle is delayed. The pulse.timeout config controls the max wait.
  • UniFi LLDP data is only as fresh as the last monitor poll (120s default).
Description
GANDALF (Global Advanced Network Detection And Link Facilitator)
Readme 827 KiB
Languages
HTML 35.9%
Python 34.3%
CSS 24.6%
JavaScript 5.2%