upsert_event now returns ticket_id (4th element) so callers can skip ticket creation when one already exists. This prevents calling the ticket API every poll cycle for ongoing issues while still retrying if the previous creation attempt failed (ticket_id stays NULL until success). Cluster events use (is_new or not ticket_id) so they too get retried on failure rather than relying solely on is_new. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GANDALF (Global Advanced Network Detection And Link Facilitator)
Because it shall not let problems pass.
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.
Design System: web_template — shared CSS, JS, and layout patterns for all LotusGuild apps
Styling & Layout
GANDALF uses the LotusGuild Terminal Design System. For all styling, component, and layout documentation see:
web_template/README.md— full component reference, CSS variables, JS APIweb_template/base.css— unified CSS (.lt-*classes)web_template/base.js—window.ltutilities (toast, modal, auto-refresh, fetch helpers)web_template/python/base.html— Jinja2 base templateweb_template/python/auth.py—@require_authdecorator pattern
Architecture
Two processes share a MariaDB database:
| Process | Service | Role |
|---|---|---|
app.py |
gandalf.service |
Flask web dashboard (gunicorn, port 8000) |
monitor.py |
gandalf-monitor.service |
Background polling daemon |
[Prometheus :9090] ──▶
[UniFi Controller] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker] ──▶
[SSH / ethtool] ──▶
Data Sources
| Source | What it provides |
|---|---|
Prometheus (10.10.10.48:9090) |
Physical NIC link state + traffic/error rates via node_exporter |
UniFi API (https://10.10.10.1) |
Switch port stats, device status, LLDP neighbor table, PoE data |
| Pulse Worker | SSH relay — runs ethtool + SFP DOM queries on each Proxmox host |
| Ping | Reachability for hosts without node_exporter (e.g. PBS) |
Monitored Hosts (Prometheus / node_exporter)
| Host | Prometheus Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |
Ping-only (no node_exporter): pbs (10.10.10.3)
Pages
Dashboard (/)
- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
- Network topology diagram (Internet → Gateway → Switches → Hosts)
- UniFi device table (switches, APs, gateway)
- Active alerts table with severity, target, consecutive failures, ticket link
- Quick-suppress modal: apply timed or manual suppression from any alert row
- Auto-refreshes every 30 seconds via
/api/status+/api/network
Link Debug (/links)
Per-interface statistics collected every poll cycle. All panels are collapsible
(click header or use Collapse All / Expand All). Collapse state persists across
page refreshes via sessionStorage.
Server NICs (via Prometheus + SSH/ethtool):
- Speed, duplex, auto-negotiation, link detected
- TX/RX rate bars (bandwidth utilisation % of link capacity)
- TX/RX error and drop rates per second
- Carrier changes (cumulative since boot — watch for flapping)
- SFP / Optical panel (when SFP module present): vendor/PN, temp, voltage, bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars
UniFi Switch Ports (via UniFi API):
- Port number badge (
#N), UPLINK badge, PoE draw badge - LLDP neighbor line:
→ system_name (port_id)when neighbor is detected - PoE class and max wattage line
- Speed, duplex, auto-neg, TX/RX rates, errors, drops
Inspector (/inspector)
Visual switch chassis diagrams. Each switch is rendered model-accurately using
layout config in the template (SWITCH_LAYOUTS).
Port block colours:
| Colour | State |
|---|---|
| Green | Up, no active PoE |
| Amber | Up with active PoE draw |
| Cyan | Uplink port (up) |
| Grey | Down |
| White outline | Currently selected |
Clicking a port opens the right-side detail panel showing:
- Link stats (status, speed, duplex, auto-neg, media type)
- PoE (class, max wattage, current draw, mode)
- Traffic (TX/RX rates)
- Errors/drops per second
- LLDP Neighbor section (system name, port ID, chassis ID, management IPs)
- Path Debug (auto-appears when LLDP
system_namematches a known server): two-column comparison of the switch port stats vs. the server NIC stats, including SFP DOM data if the server side has an SFP module
LLDP path debug requirements:
- Server must run
lldpd:apt install lldpd && systemctl enable --now lldpd lldpdhostname must match the key indata.hosts(set viaconfig.json → hosts)- Switch has LLDP enabled (UniFi default: on)
Supported switch models (set SWITCH_LAYOUTS keys to your UniFi model codes):
| Key | Model | Layout |
|---|---|---|
USF5P |
UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink |
USL8A |
UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) |
US24PRO |
UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP |
USPPDUP |
Custom/other | Single-port fallback |
USMINI |
UniFi Switch Mini | 5-port row |
Add new layouts by adding a key to SWITCH_LAYOUTS matching the model field
returned by the UniFi API for that device.
Suppressions (/suppressions)
- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
- Target types: host, interface, UniFi device, or global
- Active suppressions table with one-click removal
- Suppression history (last 50)
- Available targets reference grid (all known hosts + interfaces)
Alert Logic
Ticket Triggers
| Condition | Priority |
|---|---|
| UniFi device offline (≥2 consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
| Host unreachable via ping (≥2 consecutive checks) | P2 High |
| ≥3 hosts simultaneously reporting interface failures | P1 Critical |
Baseline Tracking
Interfaces that are down on first observation (unused ports, unplugged cables)
are recorded as initial_down and never alerted. Only UP→DOWN regressions
generate tickets. Baseline is stored in MariaDB and survives daemon restarts.
Suppression Targets
| Type | Suppresses |
|---|---|
host |
All interface alerts for a named host |
interface |
A specific NIC on a specific host |
unifi_device |
A specific UniFi device |
all |
Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire). Expired suppressions are checked at evaluation time — no background cleanup needed.
Configuration (config.json)
Shared by both processes. Located in the working directory (/var/www/html/prod/).
{
"database": {
"host": "10.10.10.50",
"port": 3306,
"user": "gandalf",
"password": "...",
"name": "gandalf"
},
"prometheus": {
"url": "http://10.10.10.48:9090"
},
"unifi": {
"controller": "https://10.10.10.1",
"api_key": "...",
"site_id": "default"
},
"ticket_api": {
"url": "https://t.lotusguild.org/api/tickets",
"api_key": "..."
},
"pulse": {
"url": "http://<pulse-host>:<port>",
"api_key": "...",
"worker_id": "...",
"timeout": 45
},
"auth": {
"allowed_groups": ["admin"]
},
"hosts": [
{ "name": "large1", "prometheus_instance": "10.10.10.2:9100" },
{ "name": "compute-storage-01", "prometheus_instance": "10.10.10.4:9100" },
{ "name": "micro1", "prometheus_instance": "10.10.10.8:9100" },
{ "name": "monitor-02", "prometheus_instance": "10.10.10.9:9100" },
{ "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
{ "name": "storage-01", "prometheus_instance": "10.10.10.11:9100" }
],
"monitor": {
"poll_interval": 120,
"failure_threshold": 2,
"cluster_threshold": 3,
"ping_hosts": [
{ "name": "pbs", "ip": "10.10.10.3" }
]
}
}
Key Config Fields
| Key | Description |
|---|---|
database.* |
MariaDB credentials (LXC 149 at 10.10.10.50) |
prometheus.url |
Prometheus base URL |
unifi.controller |
UniFi controller base URL (HTTPS, self-signed cert ignored) |
unifi.api_key |
UniFi API key from controller Settings → API |
unifi.site_id |
UniFi site ID (default: default) |
ticket_api.api_key |
Tinker Tickets bearer token |
pulse.url |
Pulse worker API base URL (for SSH relay) |
pulse.worker_id |
Which Pulse worker runs ethtool collection |
pulse.timeout |
Max seconds to wait for SSH collection per host |
auth.allowed_groups |
Authelia groups that may access Gandalf |
hosts |
Maps Prometheus instance labels → display hostnames |
monitor.poll_interval |
Seconds between full check cycles (default: 120) |
monitor.failure_threshold |
Consecutive failures before creating ticket (default: 2) |
monitor.cluster_threshold |
Hosts with failures to trigger cluster-wide P1 (default: 3) |
monitor.ping_hosts |
Hosts checked only by ping (no node_exporter) |
Deployment (LXC 157)
1. Database — MariaDB LXC 149 (10.10.10.50)
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
Import schema:
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
2. LXC 157 — Install dependencies
pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass
3. Deploy files
# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
static/style.css static/app.js \
templates/*.html; do
sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
"$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor
4. systemd services
gandalf.service (Flask/gunicorn web app):
[Unit]
Description=Gandalf Web Dashboard
After=network.target
[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always
[Install]
WantedBy=multi-user.target
gandalf-monitor.service (background polling daemon):
[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target
[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always
[Install]
WantedBy=multi-user.target
5. Authelia rule (LXC 167)
access_control:
rules:
- domain: gandalf.lotusguild.org
policy: one_factor
subject:
- group:admin
systemctl restart authelia
6. NPM reverse proxy
- Domain:
gandalf.lotusguild.org - Forward to:
http://10.10.10.61:8000(gunicorn direct, no nginx needed on LXC) - Forward Auth: Authelia at
http://10.10.10.167:9091 - WebSockets: Not required
Service Management
# Status
systemctl status gandalf gandalf-monitor
# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f
# Restart after code or config changes
systemctl restart gandalf gandalf-monitor
Troubleshooting
Monitor not creating tickets
- Verify
config.json → ticket_api.api_keyis set and valid - Check
journalctl -u gandalf-monitorforTicket creation failedlines - Confirm the Tinker Tickets API is reachable from LXC 157
Link Debug shows no data / "Loading…" forever
- Check
gandalf-monitor.serviceis running and has completed at least one cycle - Check
journalctl -u gandalf-monitorfor Prometheus or UniFi errors - Verify Prometheus is reachable:
curl http://10.10.10.48:9090/api/v1/query?query=up
Link Debug: SFP DOM panel missing
- SFP data requires Pulse worker + SSH access to hosts
- Verify
config.json → pulse.*is configured and the Pulse worker is running - Confirm
sshpass+ SSH access from the Pulse worker to each Proxmox host - Only interfaces with physical SFP modules return DOM data (
ethtool -m)
Inspector: path debug section not appearing
- Requires LLDP: run
apt install lldpd && systemctl enable --now lldpdon each server - The LLDP
system_namebroadcast bylldpdmust match the hostname inconfig.json → hosts[].name- Override:
echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd
- Override:
- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate
Inspector: switch chassis shows as flat list (no layout)
- The switch's
modelfield from UniFi doesn't match any key inSWITCH_LAYOUTSininspector.html - Check the UniFi API: the model appears in the
link_statsAPI response underunifi_switches.<name>.model - Add the model key to
SWITCH_LAYOUTSininspector.htmlwith the correct row/SFP layout
Baseline re-initializing on every restart
interface_baselineis stored in themonitor_stateDB table; survives restarts- If it appears to reset: check DB connectivity from the monitor daemon
Interface stuck at "initial_down" forever
- This means the interface was down when the monitor first saw it
- It will begin tracking once it comes up; or manually clear it:
Then restart the monitor:
-- In MariaDB on 10.10.10.50: UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';systemctl restart gandalf-monitor
Prometheus data missing for a host
# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
Development Notes
File Layout
gandalf/
├── app.py # Flask web app (routes, auth, API endpoints)
├── monitor.py # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql # Database schema (network_events, suppression_rules, monitor_state)
├── config.json # Runtime configuration (not committed with secrets)
├── requirements.txt # Python dependencies
├── static/
│ ├── style.css # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│ └── app.js # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
├── base.html # Shared layout (header, nav, footer)
├── index.html # Dashboard page
├── links.html # Link Debug page (server NICs + UniFi switch ports)
├── inspector.html # Visual switch inspector + LLDP path debug
└── suppressions.html # Suppression management page
Adding a New Monitored Host
- Install
prometheus-node-exporteron the host - Add a scrape target to Prometheus config
- Add an entry to
config.json → hosts:{ "name": "newhost", "prometheus_instance": "10.10.10.X:9100" } - Restart monitor:
systemctl restart gandalf-monitor - For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker
Adding a New Switch Layout (Inspector)
Find the UniFi model code for the switch (it appears in the /api/links JSON response
under unifi_switches.<switch_name>.model), then add to SWITCH_LAYOUTS in
templates/inspector.html:
'MYNEWMODEL': {
rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]], // port_idx by row
sfp_section: [17, 18], // separate SFP cage ports (rendered below rows)
sfp_ports: [], // port_idx values that are SFP-type within rows
},
Database Schema Notes
network_events: one row per active event;resolved_atis set when recoveredsuppression_rules:active=FALSEwhen removed;expires_atchecked at query timemonitor_state: key/value store;interface_baselineandlink_statsare JSON blobs
Security Notes
- XSS prevention: all user-controlled data in dynamically generated HTML uses
escHtml()(JS) or Jinja2 auto-escaping (Python). Suppress buttons usedata-*attributes + a single delegated click listener rather than inlineonclickwith interpolated strings. - Interface name validation:
monitor.pyvalidates SSH interface names against^[a-zA-Z0-9_.@-]+$before use, and additionally wraps them withshlex.quote()for defense-in-depth. - DB parameters: all SQL uses parameterised queries via pymysql — no string concatenation into SQL.
- Auth: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
app additionally checks the
Remote-Userheader via@require_auth.
Known Limitations
- Single gunicorn worker (
--workers 1) — required becausedb.pyuses thread-local connection reuse (one connection per thread). Multiple workers would each have their own connection, which is fine, but the thread-local optimisation only helps within one worker. - No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
SameSite=Strictand the site being admin-only. - SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
is delayed. The
pulse.timeoutconfig controls the max wait. - UniFi LLDP data is only as fresh as the last monitor poll (120s default).
CI / CD
| Workflow | Purpose | Triggers |
|---|---|---|
lint.yml (python-lint) |
flake8 on all .py files |
Every push and PR |
lint.yml (js-lint) |
ESLint on static/ |
Every push and PR |
test.yml |
pytest — 33 tests for diagnose.py static methods |
Every push and PR |
security.yml |
bandit -ll (medium+ severity) |
Every push, PR, and weekly Monday 6am |
deploy job in lint.yml |
Calls the gandalf-deploy webhook on CT157 (10.10.10.61) |
Push to main only, after both lint jobs pass |
Branch protection is enabled on main — both lint jobs must pass before any PR can merge.
Tests live in tests/test_diagnose.py and cover DiagnosticsRunner static methods:
build_ssh_command, parse_output, parse_sysfs_stats, parse_ethtool, and variants.