Wrap window.fetch so any 401 triggers window.location.reload(), sending the browser back through the Authelia proxy to the login page. Covers all pages since app.js is loaded by base.html. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GANDALF (Global Advanced Network Detection And Link Facilitator)
Because it shall not let problems pass.
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.
Design System: web_template — shared CSS, JS, and layout patterns for all LotusGuild apps
Styling & Layout
GANDALF uses the LotusGuild Terminal Design System. For all styling, component, and layout documentation see:
web_template/README.md— full component reference, CSS variables, JS APIweb_template/base.css— unified CSS (.lt-*classes)web_template/base.js—window.ltutilities (toast, modal, auto-refresh, fetch helpers)web_template/python/base.html— Jinja2 base templateweb_template/python/auth.py—@require_authdecorator pattern
Architecture
Two processes share a MariaDB database:
| Process | Service | Role |
|---|---|---|
app.py |
gandalf.service |
Flask web dashboard (gunicorn, port 8000) |
monitor.py |
gandalf-monitor.service |
Background polling daemon |
[Prometheus :9090] ──▶
[UniFi Controller] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker] ──▶
[SSH / ethtool] ──▶
Data Sources
| Source | What it provides |
|---|---|
Prometheus (10.10.10.48:9090) |
Physical NIC link state + traffic/error rates via node_exporter |
UniFi API (https://10.10.10.1) |
Switch port stats, device status, LLDP neighbor table, PoE data |
| Pulse Worker | SSH relay — runs ethtool + SFP DOM queries on each Proxmox host |
| Ping | Reachability for hosts without node_exporter (e.g. PBS) |
Monitored Hosts (Prometheus / node_exporter)
| Host | Prometheus Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |
Ping-only (no node_exporter): pbs (10.10.10.3)
Pages
Dashboard (/)
- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
- Network topology diagram (Internet → Gateway → Switches → Hosts)
- UniFi device table (switches, APs, gateway)
- Active alerts table with severity, target, consecutive failures, ticket link
- Quick-suppress modal: apply timed or manual suppression from any alert row
- Auto-refreshes every 30 seconds via
/api/status+/api/network
Link Debug (/links)
Per-interface statistics collected every poll cycle. All panels are collapsible
(click header or use Collapse All / Expand All). Collapse state persists across
page refreshes via sessionStorage.
Server NICs (via Prometheus + SSH/ethtool):
- Speed, duplex, auto-negotiation, link detected
- TX/RX rate bars (bandwidth utilisation % of link capacity)
- TX/RX error and drop rates per second
- Carrier changes (cumulative since boot — watch for flapping)
- SFP / Optical panel (when SFP module present): vendor/PN, temp, voltage, bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars
UniFi Switch Ports (via UniFi API):
- Port number badge (
#N), UPLINK badge, PoE draw badge - LLDP neighbor line:
→ system_name (port_id)when neighbor is detected - PoE class and max wattage line
- Speed, duplex, auto-neg, TX/RX rates, errors, drops
Inspector (/inspector)
Visual switch chassis diagrams. Each switch is rendered model-accurately using
layout config in the template (SWITCH_LAYOUTS).
Port block colours:
| Colour | State |
|---|---|
| Green | Up, no active PoE |
| Amber | Up with active PoE draw |
| Cyan | Uplink port (up) |
| Grey | Down |
| White outline | Currently selected |
Clicking a port opens the right-side detail panel showing:
- Link stats (status, speed, duplex, auto-neg, media type)
- PoE (class, max wattage, current draw, mode)
- Traffic (TX/RX rates)
- Errors/drops per second
- LLDP Neighbor section (system name, port ID, chassis ID, management IPs)
- Path Debug (auto-appears when LLDP
system_namematches a known server): two-column comparison of the switch port stats vs. the server NIC stats, including SFP DOM data if the server side has an SFP module
LLDP path debug requirements:
- Server must run
lldpd:apt install lldpd && systemctl enable --now lldpd lldpdhostname must match the key indata.hosts(set viaconfig.json → hosts)- Switch has LLDP enabled (UniFi default: on)
Supported switch models (set SWITCH_LAYOUTS keys to your UniFi model codes):
| Key | Model | Layout |
|---|---|---|
USF5P |
UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink |
USL8A |
UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) |
US24PRO |
UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP |
USPPDUP |
Custom/other | Single-port fallback |
USMINI |
UniFi Switch Mini | 5-port row |
Add new layouts by adding a key to SWITCH_LAYOUTS matching the model field
returned by the UniFi API for that device.
Suppressions (/suppressions)
- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
- Target types: host, interface, UniFi device, or global
- Active suppressions table with one-click removal
- Suppression history (last 50)
- Available targets reference grid (all known hosts + interfaces)
Alert Logic
Ticket Triggers
| Condition | Priority |
|---|---|
| UniFi device offline (≥2 consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
| Host unreachable via ping (≥2 consecutive checks) | P2 High |
| ≥3 hosts simultaneously reporting interface failures | P1 Critical |
Baseline Tracking
Interfaces that are down on first observation (unused ports, unplugged cables)
are recorded as initial_down and never alerted. Only UP→DOWN regressions
generate tickets. Baseline is stored in MariaDB and survives daemon restarts.
Suppression Targets
| Type | Suppresses |
|---|---|
host |
All interface alerts for a named host |
interface |
A specific NIC on a specific host |
unifi_device |
A specific UniFi device |
all |
Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire). Expired suppressions are checked at evaluation time — no background cleanup needed.
Configuration (config.json)
Shared by both processes. Located in the working directory (/var/www/html/prod/).
{
"database": {
"host": "10.10.10.50",
"port": 3306,
"user": "gandalf",
"password": "...",
"name": "gandalf"
},
"prometheus": {
"url": "http://10.10.10.48:9090"
},
"unifi": {
"controller": "https://10.10.10.1",
"api_key": "...",
"site_id": "default"
},
"ticket_api": {
"url": "https://t.lotusguild.org/api/tickets",
"api_key": "..."
},
"pulse": {
"url": "http://<pulse-host>:<port>",
"api_key": "...",
"worker_id": "...",
"timeout": 45
},
"auth": {
"allowed_groups": ["admin"]
},
"hosts": [
{ "name": "large1", "prometheus_instance": "10.10.10.2:9100" },
{ "name": "compute-storage-01", "prometheus_instance": "10.10.10.4:9100" },
{ "name": "micro1", "prometheus_instance": "10.10.10.8:9100" },
{ "name": "monitor-02", "prometheus_instance": "10.10.10.9:9100" },
{ "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
{ "name": "storage-01", "prometheus_instance": "10.10.10.11:9100" }
],
"monitor": {
"poll_interval": 120,
"failure_threshold": 2,
"cluster_threshold": 3,
"ping_hosts": [
{ "name": "pbs", "ip": "10.10.10.3" }
]
}
}
Key Config Fields
| Key | Description |
|---|---|
database.* |
MariaDB credentials (LXC 149 at 10.10.10.50) |
prometheus.url |
Prometheus base URL |
unifi.controller |
UniFi controller base URL (HTTPS, self-signed cert ignored) |
unifi.api_key |
UniFi API key from controller Settings → API |
unifi.site_id |
UniFi site ID (default: default) |
ticket_api.api_key |
Tinker Tickets bearer token |
pulse.url |
Pulse worker API base URL (for SSH relay) |
pulse.worker_id |
Which Pulse worker runs ethtool collection |
pulse.timeout |
Max seconds to wait for SSH collection per host |
auth.allowed_groups |
Authelia groups that may access Gandalf |
hosts |
Maps Prometheus instance labels → display hostnames |
monitor.poll_interval |
Seconds between full check cycles (default: 120) |
monitor.failure_threshold |
Consecutive failures before creating ticket (default: 2) |
monitor.cluster_threshold |
Hosts with failures to trigger cluster-wide P1 (default: 3) |
monitor.ping_hosts |
Hosts checked only by ping (no node_exporter) |
Deployment (LXC 157)
1. Database — MariaDB LXC 149 (10.10.10.50)
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
Import schema:
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
2. LXC 157 — Install dependencies
pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass
3. Deploy files
# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
static/style.css static/app.js \
templates/*.html; do
sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
"$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor
4. systemd services
gandalf.service (Flask/gunicorn web app):
[Unit]
Description=Gandalf Web Dashboard
After=network.target
[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always
[Install]
WantedBy=multi-user.target
gandalf-monitor.service (background polling daemon):
[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target
[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always
[Install]
WantedBy=multi-user.target
5. Authelia rule (LXC 167)
access_control:
rules:
- domain: gandalf.lotusguild.org
policy: one_factor
subject:
- group:admin
systemctl restart authelia
6. NPM reverse proxy
- Domain:
gandalf.lotusguild.org - Forward to:
http://10.10.10.61:8000(gunicorn direct, no nginx needed on LXC) - Forward Auth: Authelia at
http://10.10.10.167:9091 - WebSockets: Not required
Service Management
# Status
systemctl status gandalf gandalf-monitor
# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f
# Restart after code or config changes
systemctl restart gandalf gandalf-monitor
Troubleshooting
Monitor not creating tickets
- Verify
config.json → ticket_api.api_keyis set and valid - Check
journalctl -u gandalf-monitorforTicket creation failedlines - Confirm the Tinker Tickets API is reachable from LXC 157
Link Debug shows no data / "Loading…" forever
- Check
gandalf-monitor.serviceis running and has completed at least one cycle - Check
journalctl -u gandalf-monitorfor Prometheus or UniFi errors - Verify Prometheus is reachable:
curl http://10.10.10.48:9090/api/v1/query?query=up
Link Debug: SFP DOM panel missing
- SFP data requires Pulse worker + SSH access to hosts
- Verify
config.json → pulse.*is configured and the Pulse worker is running - Confirm
sshpass+ SSH access from the Pulse worker to each Proxmox host - Only interfaces with physical SFP modules return DOM data (
ethtool -m)
Inspector: path debug section not appearing
- Requires LLDP: run
apt install lldpd && systemctl enable --now lldpdon each server - The LLDP
system_namebroadcast bylldpdmust match the hostname inconfig.json → hosts[].name- Override:
echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd
- Override:
- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate
Inspector: switch chassis shows as flat list (no layout)
- The switch's
modelfield from UniFi doesn't match any key inSWITCH_LAYOUTSininspector.html - Check the UniFi API: the model appears in the
link_statsAPI response underunifi_switches.<name>.model - Add the model key to
SWITCH_LAYOUTSininspector.htmlwith the correct row/SFP layout
Baseline re-initializing on every restart
interface_baselineis stored in themonitor_stateDB table; survives restarts- If it appears to reset: check DB connectivity from the monitor daemon
Interface stuck at "initial_down" forever
- This means the interface was down when the monitor first saw it
- It will begin tracking once it comes up; or manually clear it:
Then restart the monitor:
-- In MariaDB on 10.10.10.50: UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';systemctl restart gandalf-monitor
Prometheus data missing for a host
# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
Development Notes
File Layout
gandalf/
├── app.py # Flask web app (routes, auth, API endpoints)
├── monitor.py # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql # Database schema (network_events, suppression_rules, monitor_state)
├── config.json # Runtime configuration (not committed with secrets)
├── requirements.txt # Python dependencies
├── static/
│ ├── style.css # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│ └── app.js # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
├── base.html # Shared layout (header, nav, footer)
├── index.html # Dashboard page
├── links.html # Link Debug page (server NICs + UniFi switch ports)
├── inspector.html # Visual switch inspector + LLDP path debug
└── suppressions.html # Suppression management page
Adding a New Monitored Host
- Install
prometheus-node-exporteron the host - Add a scrape target to Prometheus config
- Add an entry to
config.json → hosts:{ "name": "newhost", "prometheus_instance": "10.10.10.X:9100" } - Restart monitor:
systemctl restart gandalf-monitor - For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker
Adding a New Switch Layout (Inspector)
Find the UniFi model code for the switch (it appears in the /api/links JSON response
under unifi_switches.<switch_name>.model), then add to SWITCH_LAYOUTS in
templates/inspector.html:
'MYNEWMODEL': {
rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]], // port_idx by row
sfp_section: [17, 18], // separate SFP cage ports (rendered below rows)
sfp_ports: [], // port_idx values that are SFP-type within rows
},
Database Schema Notes
network_events: one row per active event;resolved_atis set when recoveredsuppression_rules:active=FALSEwhen removed;expires_atchecked at query timemonitor_state: key/value store;interface_baselineandlink_statsare JSON blobs
Security Notes
- XSS prevention: all user-controlled data in dynamically generated HTML uses
escHtml()(JS) or Jinja2 auto-escaping (Python). Suppress buttons usedata-*attributes + a single delegated click listener rather than inlineonclickwith interpolated strings. - Interface name validation:
monitor.pyvalidates SSH interface names against^[a-zA-Z0-9_.@-]+$before use, and additionally wraps them withshlex.quote()for defense-in-depth. - DB parameters: all SQL uses parameterised queries via pymysql — no string concatenation into SQL.
- Auth: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
app additionally checks the
Remote-Userheader via@require_auth.
Known Limitations
- Single gunicorn worker (
--workers 1) — required becausedb.pyuses thread-local connection reuse (one connection per thread). Multiple workers would each have their own connection, which is fine, but the thread-local optimisation only helps within one worker. - No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
SameSite=Strictand the site being admin-only. - SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
is delayed. The
pulse.timeoutconfig controls the max wait. - UniFi LLDP data is only as fresh as the last monitor poll (120s default).