2025-01-04 00:07:15 -05:00
# GANDALF (Global Advanced Network Detection And Link Facilitator)
2026-03-03 15:39:48 -05:00
> Because it shall not let problems pass.
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on **LXC 157 ** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org` .
2025-01-04 00:07:15 -05:00
2026-03-14 21:40:20 -04:00
**Design System ** : [web_template ](https://code.lotusguild.org/LotusGuild/web_template ) — shared CSS, JS, and layout patterns for all LotusGuild apps
## Styling & Layout
GANDALF uses the **LotusGuild Terminal Design System ** . For all styling, component, and layout documentation see:
- [`web_template/README.md` ](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/README.md ) — full component reference, CSS variables, JS API
- [`web_template/base.css` ](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.css ) — unified CSS (`.lt-*` classes)
- [`web_template/base.js` ](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.js ) — `window.lt` utilities (toast, modal, auto-refresh, fetch helpers)
- [`web_template/python/base.html` ](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/base.html ) — Jinja2 base template
- [`web_template/python/auth.py` ](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/auth.py ) — `@require_auth` decorator pattern
2026-03-01 23:03:18 -05:00
---
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
## Architecture
2025-01-04 00:07:15 -05:00
2026-03-03 15:39:48 -05:00
Two processes share a MariaDB database:
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
| Process | Service | Role |
|---|---|---|
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
```
2026-03-03 15:39:48 -05:00
[Prometheus :9090] ──▶
[UniFi Controller] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker] ──▶
[SSH / ethtool] ──▶
2026-03-01 23:03:18 -05:00
```
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
### Data Sources
2025-01-04 00:07:15 -05:00
2026-03-03 15:39:48 -05:00
| Source | What it provides |
2026-03-01 23:03:18 -05:00
|---|---|
2026-03-03 15:39:48 -05:00
| **Prometheus ** (`10.10.10.48:9090` ) | Physical NIC link state + traffic/error rates via `node_exporter` |
| **UniFi API ** (`https://10.10.10.1` ) | Switch port stats, device status, LLDP neighbor table, PoE data |
| **Pulse Worker ** | SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host |
| **Ping ** | Reachability for hosts without `node_exporter` (e.g. PBS) |
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
### Monitored Hosts (Prometheus / node_exporter)
2025-01-04 00:07:15 -05:00
2026-03-03 15:39:48 -05:00
| Host | Prometheus Instance |
2026-03-01 23:03:18 -05:00
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
2026-03-01 23:05:27 -05:00
| storage-01 | 10.10.10.11:9100 |
2025-01-04 00:07:15 -05:00
2026-03-03 15:39:48 -05:00
Ping-only (no node_exporter): **pbs ** (10.10.10.3)
2025-01-04 00:07:15 -05:00
2026-03-03 15:39:48 -05:00
---
2025-01-04 00:07:15 -05:00
2026-03-03 15:39:48 -05:00
## Pages
### Dashboard (`/`)
- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
- Network topology diagram (Internet → Gateway → Switches → Hosts)
- UniFi device table (switches, APs, gateway)
- Active alerts table with severity, target, consecutive failures, ticket link
- Quick-suppress modal: apply timed or manual suppression from any alert row
- Auto-refreshes every 30 seconds via `/api/status` + `/api/network`
### Link Debug (`/links`)
Per-interface statistics collected every poll cycle. All panels are collapsible
(click header or use Collapse All / Expand All). Collapse state persists across
page refreshes via `sessionStorage` .
**Server NICs ** (via Prometheus + SSH/ethtool):
- Speed, duplex, auto-negotiation, link detected
- TX/RX rate bars (bandwidth utilisation % of link capacity)
- TX/RX error and drop rates per second
- Carrier changes (cumulative since boot — watch for flapping)
- **SFP / Optical panel** (when SFP module present): vendor/PN, temp, voltage,
bias current, TX power (dBm), RX power (dBm), RX− TX delta, per-stat bars
**UniFi Switch Ports ** (via UniFi API):
- Port number badge (`#N` ), UPLINK badge, PoE draw badge
- LLDP neighbor line: `→ system_name (port_id)` when neighbor is detected
- PoE class and max wattage line
- Speed, duplex, auto-neg, TX/RX rates, errors, drops
### Inspector (`/inspector`)
Visual switch chassis diagrams. Each switch is rendered model-accurately using
layout config in the template (`SWITCH_LAYOUTS` ).
**Port block colours: **
| Colour | State |
|---|---|
| Green | Up, no active PoE |
| Amber | Up with active PoE draw |
| Cyan | Uplink port (up) |
| Grey | Down |
| White outline | Currently selected |
**Clicking a port ** opens the right-side detail panel showing:
- Link stats (status, speed, duplex, auto-neg, media type)
- PoE (class, max wattage, current draw, mode)
- Traffic (TX/RX rates)
- Errors/drops per second
- **LLDP Neighbor** section (system name, port ID, chassis ID, management IPs)
- **Path Debug** (auto-appears when LLDP `system_name` matches a known server):
two-column comparison of the switch port stats vs. the server NIC stats,
including SFP DOM data if the server side has an SFP module
**LLDP path debug requirements: **
1. Server must run `lldpd` : `apt install lldpd && systemctl enable --now lldpd`
2. `lldpd` hostname must match the key in `data.hosts` (set via `config.json → hosts` )
3. Switch has LLDP enabled (UniFi default: on)
**Supported switch models ** (set `SWITCH_LAYOUTS` keys to your UniFi model codes):
| Key | Model | Layout |
|---|---|---|
| `USF5P` | UniFi Switch Flex 5 PoE | 4× RJ45 + 1× SFP uplink |
| `USL8A` | UniFi Switch Lite 8 PoE | 8× SFP (2 rows of 4) |
| `US24PRO` | UniFi Switch Pro 24 | 24× RJ45 staggered + 2× SFP |
| `USPPDUP` | Custom/other | Single-port fallback |
| `USMINI` | UniFi Switch Mini | 5-port row |
Add new layouts by adding a key to `SWITCH_LAYOUTS` matching the `model` field
returned by the UniFi API for that device.
### Suppressions (`/suppressions`)
- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
- Target types: host, interface, UniFi device, or global
- Active suppressions table with one-click removal
- Suppression history (last 50)
- Available targets reference grid (all known hosts + interfaces)
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
---
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
## Alert Logic
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
### Ticket Triggers
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
| Condition | Priority |
|---|---|
2026-03-03 15:39:48 -05:00
| UniFi device offline (≥2 consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
| Host unreachable via ping (≥2 consecutive checks) | P2 High |
| ≥3 hosts simultaneously reporting interface failures | P1 Critical |
### Baseline Tracking
Interfaces that are **down on first observation ** (unused ports, unplugged cables)
are recorded as `initial_down` and never alerted. Only **UP→DOWN regressions **
generate tickets. Baseline is stored in MariaDB and survives daemon restarts.
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
### Suppression Targets
2025-01-04 00:07:15 -05:00
2026-03-01 23:03:18 -05:00
| Type | Suppresses |
|---|---|
| `host` | All interface alerts for a named host |
| `interface` | A specific NIC on a specific host |
| `unifi_device` | A specific UniFi device |
| `all` | Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire).
2026-03-03 15:39:48 -05:00
Expired suppressions are checked at evaluation time — no background cleanup needed.
2026-03-01 23:03:18 -05:00
---
2026-03-03 15:39:48 -05:00
## Configuration (`config.json`)
Shared by both processes. Located in the working directory (`/var/www/html/prod/` ).
``` json
{
"database" : {
"host" : "10.10.10.50" ,
"port" : 3306 ,
"user" : "gandalf" ,
"password" : "..." ,
"name" : "gandalf"
} ,
"prometheus" : {
"url" : "http://10.10.10.48:9090"
} ,
"unifi" : {
"controller" : "https://10.10.10.1" ,
"api_key" : "..." ,
"site_id" : "default"
} ,
"ticket_api" : {
"url" : "https://t.lotusguild.org/api/tickets" ,
"api_key" : "..."
} ,
"pulse" : {
"url" : "http://<pulse-host>:<port>" ,
"api_key" : "..." ,
"worker_id" : "..." ,
"timeout" : 45
} ,
"auth" : {
"allowed_groups" : [ "admin" ]
} ,
"hosts" : [
{ "name" : "large1" , "prometheus_instance" : "10.10.10.2:9100" } ,
{ "name" : "compute-storage-01" , "prometheus_instance" : "10.10.10.4:9100" } ,
{ "name" : "micro1" , "prometheus_instance" : "10.10.10.8:9100" } ,
{ "name" : "monitor-02" , "prometheus_instance" : "10.10.10.9:9100" } ,
{ "name" : "compute-storage-gpu-01" , "prometheus_instance" : "10.10.10.10:9100" } ,
{ "name" : "storage-01" , "prometheus_instance" : "10.10.10.11:9100" }
] ,
"monitor" : {
"poll_interval" : 120 ,
"failure_threshold" : 2 ,
"cluster_threshold" : 3 ,
"ping_hosts" : [
{ "name" : "pbs" , "ip" : "10.10.10.3" }
]
}
}
```
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
### Key Config Fields
2026-03-01 23:03:18 -05:00
| Key | Description |
|---|---|
2026-03-03 15:39:48 -05:00
| `database.*` | MariaDB credentials (LXC 149 at 10.10.10.50) |
2026-03-01 23:03:18 -05:00
| `prometheus.url` | Prometheus base URL |
2026-03-03 15:39:48 -05:00
| `unifi.controller` | UniFi controller base URL (HTTPS, self-signed cert ignored) |
| `unifi.api_key` | UniFi API key from controller Settings → API |
| `unifi.site_id` | UniFi site ID (default: `default` ) |
| `ticket_api.api_key` | Tinker Tickets bearer token |
| `pulse.url` | Pulse worker API base URL (for SSH relay) |
| `pulse.worker_id` | Which Pulse worker runs ethtool collection |
| `pulse.timeout` | Max seconds to wait for SSH collection per host |
| `auth.allowed_groups` | Authelia groups that may access Gandalf |
| `hosts` | Maps Prometheus instance labels → display hostnames |
| `monitor.poll_interval` | Seconds between full check cycles (default: 120) |
| `monitor.failure_threshold` | Consecutive failures before creating ticket (default: 2) |
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster-wide P1 (default: 3) |
| `monitor.ping_hosts` | Hosts checked only by ping (no node_exporter) |
2026-03-01 23:03:18 -05:00
---
## Deployment (LXC 157)
2026-03-03 15:39:48 -05:00
### 1. Database — MariaDB LXC 149 (`10.10.10.50`)
2026-03-01 23:03:18 -05:00
``` sql
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ;
CREATE USER ' gandalf ' @ ' 10.10.10.61 ' IDENTIFIED BY ' your_password ' ;
GRANT ALL PRIVILEGES ON gandalf . * TO ' gandalf ' @ ' 10.10.10.61 ' ;
FLUSH PRIVILEGES ;
```
2026-03-03 15:39:48 -05:00
Import schema:
2026-03-01 23:03:18 -05:00
``` bash
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
```
2026-03-03 15:39:48 -05:00
### 2. LXC 157 — Install dependencies
2026-03-01 23:03:18 -05:00
``` bash
pip3 install -r requirements.txt
2026-03-03 15:39:48 -05:00
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass
2026-03-01 23:03:18 -05:00
```
### 3. Deploy files
``` bash
2026-03-03 15:39:48 -05:00
# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
static/style.css static/app.js \
templates/*.html; do
sshpass -p 'yourpass' scp -o StrictHostKeyChecking = no \
" $f " " root@10.10.10.61:/var/www/html/prod/ $f "
done
systemctl restart gandalf gandalf-monitor
2026-03-01 23:03:18 -05:00
```
2026-03-03 15:39:48 -05:00
### 4. systemd services
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
* * `gandalf.service` ** (Flask/gunicorn web app):
``` ini
[Unit]
Description = Gandalf Web Dashboard
After = network.target
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
[Service]
Type = simple
WorkingDirectory = /var/www/html/prod
ExecStart = /usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart = always
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
[Install]
WantedBy = multi-user.target
2026-03-01 23:03:18 -05:00
```
2026-03-03 15:39:48 -05:00
* * `gandalf-monitor.service` ** (background polling daemon):
``` ini
[Unit]
Description = Gandalf Network Monitor Daemon
After = network.target
[Service]
Type = simple
WorkingDirectory = /var/www/html/prod
ExecStart = /usr/bin/python3 monitor.py
Restart = always
[Install]
WantedBy = multi-user.target
2026-03-01 23:03:18 -05:00
```
2026-03-03 15:39:48 -05:00
### 5. Authelia rule (LXC 167)
2026-03-01 23:03:18 -05:00
``` yaml
2026-03-03 15:39:48 -05:00
access_control :
rules :
- domain : gandalf.lotusguild.org
policy : one_factor
subject :
- group:admin
2026-03-01 23:03:18 -05:00
```
2026-03-03 15:39:48 -05:00
``` bash
systemctl restart authelia
```
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
### 6. NPM reverse proxy
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
- **Domain:** `gandalf.lotusguild.org`
- **Forward to:** `http://10.10.10.61:8000` (gunicorn direct, no nginx needed on LXC)
- **Forward Auth:** Authelia at `http://10.10.10.167:9091`
- **WebSockets:** Not required
2026-03-01 23:03:18 -05:00
---
## Service Management
``` bash
2026-03-03 15:39:48 -05:00
# Status
systemctl status gandalf gandalf-monitor
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
# Logs (live)
2026-03-01 23:03:18 -05:00
journalctl -u gandalf -f
2026-03-03 15:39:48 -05:00
journalctl -u gandalf-monitor -f
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
# Restart after code or config changes
systemctl restart gandalf gandalf-monitor
2026-03-01 23:03:18 -05:00
```
---
## Troubleshooting
2026-03-03 15:39:48 -05:00
### Monitor not creating tickets
- Verify `config.json → ticket_api.api_key` is set and valid
- Check `journalctl -u gandalf-monitor` for `Ticket creation failed` lines
- Confirm the Tinker Tickets API is reachable from LXC 157
### Link Debug shows no data / "Loading…" forever
- Check `gandalf-monitor.service` is running and has completed at least one cycle
- Check `journalctl -u gandalf-monitor` for Prometheus or UniFi errors
- Verify Prometheus is reachable: `curl http://10.10.10.48:9090/api/v1/query?query=up`
### Link Debug: SFP DOM panel missing
- SFP data requires Pulse worker + SSH access to hosts
- Verify `config.json → pulse.*` is configured and the Pulse worker is running
- Confirm `sshpass` + SSH access from the Pulse worker to each Proxmox host
- Only interfaces with physical SFP modules return DOM data (`ethtool -m` )
### Inspector: path debug section not appearing
- Requires LLDP: run `apt install lldpd && systemctl enable --now lldpd` on each server
- The LLDP `system_name` broadcast by `lldpd` must match the hostname in `config.json → hosts[].name`
- Override: `echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd`
- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate
### Inspector: switch chassis shows as flat list (no layout)
- The switch's `model` field from UniFi doesn't match any key in `SWITCH_LAYOUTS` in `inspector.html`
- Check the UniFi API: the model appears in the `link_stats` API response under `unifi_switches.<name>.model`
- Add the model key to `SWITCH_LAYOUTS` in `inspector.html` with the correct row/SFP layout
### Baseline re-initializing on every restart
- `interface_baseline` is stored in the `monitor_state` DB table; survives restarts
- If it appears to reset: check DB connectivity from the monitor daemon
### Interface stuck at "initial_down" forever
- This means the interface was down when the monitor first saw it
- It will begin tracking once it comes up; or manually clear it:
```sql
-- In MariaDB on 10.10.10.50:
UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
` ``
Then restart the monitor: ` systemctl restart gandalf-monitor`
### Prometheus data missing for a host
` ``bash
# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
` ``
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
---
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
## Development Notes
### File Layout
` ``
gandalf/
├── app.py # Flask web app (routes, auth, API endpoints)
├── monitor.py # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql # Database schema (network_events, suppression_rules, monitor_state)
├── config.json # Runtime configuration (not committed with secrets)
├── requirements.txt # Python dependencies
├── static/
│ ├── style.css # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│ └── app.js # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
├── base.html # Shared layout (header, nav, footer)
├── index.html # Dashboard page
├── links.html # Link Debug page (server NICs + UniFi switch ports)
├── inspector.html # Visual switch inspector + LLDP path debug
└── suppressions.html # Suppression management page
` ``
### Adding a New Monitored Host
1. Install ` prometheus-node-exporter` on the host
2. Add a scrape target to Prometheus config
3. Add an entry to ` config.json → hosts`:
` ``json
{ "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
` ``
4. Restart monitor: ` systemctl restart gandalf-monitor`
5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker
### Adding a New Switch Layout (Inspector)
Find the UniFi model code for the switch (it appears in the ` /api/links` JSON response
under ` unifi_switches.<switch_name>.model`), then add to ` SWITCH_LAYOUTS` in
` templates/inspector.html`:
` ``javascript
'MYNEWMODEL': {
rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]], // port_idx by row
sfp_section: [17, 18], // separate SFP cage ports (rendered below rows)
sfp_ports: [], // port_idx values that are SFP-type within rows
},
` ``
2026-03-01 23:03:18 -05:00
2026-03-03 15:39:48 -05:00
### Database Schema Notes
- ` network_events`: one row per active event; ` resolved_at` is set when recovered
- ` suppression_rules`: ` active=FALSE` when removed; ` expires_at` checked at query time
- ` monitor_state`: key/value store; ` interface_baseline` and ` link_stats` are JSON blobs
### Security Notes
- **XSS prevention**: all user-controlled data in dynamically generated HTML uses
` escHtml()` (JS) or Jinja2 auto-escaping (Python). Suppress buttons use ` data-*`
attributes + a single delegated click listener rather than inline ` onclick` with
interpolated strings.
- **Interface name validation**: ` monitor.py` validates SSH interface names against
` ^[a-zA-Z0-9_.@-]+$` before use, and additionally wraps them with ` shlex.quote()`
for defense-in-depth.
- **DB parameters**: all SQL uses parameterised queries via pymysql — no string
concatenation into SQL.
- **Auth**: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
app additionally checks the ` Remote-User` header via ` @require_auth `.
### Known Limitations
- Single gunicorn worker (` --workers 1`) — required because ` db.py` uses thread-local
connection reuse (one connection per thread). Multiple workers would each have their
own connection, which is fine, but the thread-local optimisation only helps within
one worker.
- No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
` SameSite=Strict` and the site being admin-only.
- SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
is delayed. The ` pulse.timeout` config controls the max wait.
- UniFi LLDP data is only as fresh as the last monitor poll (120s default).