2025-01-04 00:07:15 -05:00
|
|
|
|
# GANDALF (Global Advanced Network Detection And Link Facilitator)
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
> Because it shall not let problems pass.
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
Network monitoring dashboard for the LotusGuild Proxmox cluster.
|
|
|
|
|
|
Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
---
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
## Architecture
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
Two processes share a MariaDB database:
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
| Process | Service | Role |
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
|
|
|
|
|
|
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
[Prometheus :9090] ──▶
|
|
|
|
|
|
[UniFi Controller] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
|
|
|
|
|
|
[Pulse Worker] ──▶
|
|
|
|
|
|
[SSH / ethtool] ──▶
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
### Data Sources
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
| Source | What it provides |
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|---|---|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state + traffic/error rates via `node_exporter` |
|
|
|
|
|
|
| **UniFi API** (`https://10.10.10.1`) | Switch port stats, device status, LLDP neighbor table, PoE data |
|
|
|
|
|
|
| **Pulse Worker** | SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host |
|
|
|
|
|
|
| **Ping** | Reachability for hosts without `node_exporter` (e.g. PBS) |
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
### Monitored Hosts (Prometheus / node_exporter)
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
| Host | Prometheus Instance |
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|---|---|
|
|
|
|
|
|
| large1 | 10.10.10.2:9100 |
|
|
|
|
|
|
| compute-storage-01 | 10.10.10.4:9100 |
|
|
|
|
|
|
| micro1 | 10.10.10.8:9100 |
|
|
|
|
|
|
| monitor-02 | 10.10.10.9:9100 |
|
|
|
|
|
|
| compute-storage-gpu-01 | 10.10.10.10:9100 |
|
2026-03-01 23:05:27 -05:00
|
|
|
|
| storage-01 | 10.10.10.11:9100 |
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
Ping-only (no node_exporter): **pbs** (10.10.10.3)
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
---
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
## Pages
|
|
|
|
|
|
|
|
|
|
|
|
### Dashboard (`/`)
|
|
|
|
|
|
- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
|
|
|
|
|
|
- Network topology diagram (Internet → Gateway → Switches → Hosts)
|
|
|
|
|
|
- UniFi device table (switches, APs, gateway)
|
|
|
|
|
|
- Active alerts table with severity, target, consecutive failures, ticket link
|
|
|
|
|
|
- Quick-suppress modal: apply timed or manual suppression from any alert row
|
|
|
|
|
|
- Auto-refreshes every 30 seconds via `/api/status` + `/api/network`
|
|
|
|
|
|
|
|
|
|
|
|
### Link Debug (`/links`)
|
|
|
|
|
|
Per-interface statistics collected every poll cycle. All panels are collapsible
|
|
|
|
|
|
(click header or use Collapse All / Expand All). Collapse state persists across
|
|
|
|
|
|
page refreshes via `sessionStorage`.
|
|
|
|
|
|
|
|
|
|
|
|
**Server NICs** (via Prometheus + SSH/ethtool):
|
|
|
|
|
|
- Speed, duplex, auto-negotiation, link detected
|
|
|
|
|
|
- TX/RX rate bars (bandwidth utilisation % of link capacity)
|
|
|
|
|
|
- TX/RX error and drop rates per second
|
|
|
|
|
|
- Carrier changes (cumulative since boot — watch for flapping)
|
|
|
|
|
|
- **SFP / Optical panel** (when SFP module present): vendor/PN, temp, voltage,
|
|
|
|
|
|
bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars
|
|
|
|
|
|
|
|
|
|
|
|
**UniFi Switch Ports** (via UniFi API):
|
|
|
|
|
|
- Port number badge (`#N`), UPLINK badge, PoE draw badge
|
|
|
|
|
|
- LLDP neighbor line: `→ system_name (port_id)` when neighbor is detected
|
|
|
|
|
|
- PoE class and max wattage line
|
|
|
|
|
|
- Speed, duplex, auto-neg, TX/RX rates, errors, drops
|
|
|
|
|
|
|
|
|
|
|
|
### Inspector (`/inspector`)
|
|
|
|
|
|
Visual switch chassis diagrams. Each switch is rendered model-accurately using
|
|
|
|
|
|
layout config in the template (`SWITCH_LAYOUTS`).
|
|
|
|
|
|
|
|
|
|
|
|
**Port block colours:**
|
|
|
|
|
|
| Colour | State |
|
|
|
|
|
|
|---|---|
|
|
|
|
|
|
| Green | Up, no active PoE |
|
|
|
|
|
|
| Amber | Up with active PoE draw |
|
|
|
|
|
|
| Cyan | Uplink port (up) |
|
|
|
|
|
|
| Grey | Down |
|
|
|
|
|
|
| White outline | Currently selected |
|
|
|
|
|
|
|
|
|
|
|
|
**Clicking a port** opens the right-side detail panel showing:
|
|
|
|
|
|
- Link stats (status, speed, duplex, auto-neg, media type)
|
|
|
|
|
|
- PoE (class, max wattage, current draw, mode)
|
|
|
|
|
|
- Traffic (TX/RX rates)
|
|
|
|
|
|
- Errors/drops per second
|
|
|
|
|
|
- **LLDP Neighbor** section (system name, port ID, chassis ID, management IPs)
|
|
|
|
|
|
- **Path Debug** (auto-appears when LLDP `system_name` matches a known server):
|
|
|
|
|
|
two-column comparison of the switch port stats vs. the server NIC stats,
|
|
|
|
|
|
including SFP DOM data if the server side has an SFP module
|
|
|
|
|
|
|
|
|
|
|
|
**LLDP path debug requirements:**
|
|
|
|
|
|
1. Server must run `lldpd`: `apt install lldpd && systemctl enable --now lldpd`
|
|
|
|
|
|
2. `lldpd` hostname must match the key in `data.hosts` (set via `config.json → hosts`)
|
|
|
|
|
|
3. Switch has LLDP enabled (UniFi default: on)
|
|
|
|
|
|
|
|
|
|
|
|
**Supported switch models** (set `SWITCH_LAYOUTS` keys to your UniFi model codes):
|
|
|
|
|
|
|
|
|
|
|
|
| Key | Model | Layout |
|
|
|
|
|
|
|---|---|---|
|
|
|
|
|
|
| `USF5P` | UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink |
|
|
|
|
|
|
| `USL8A` | UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) |
|
|
|
|
|
|
| `US24PRO` | UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP |
|
|
|
|
|
|
| `USPPDUP` | Custom/other | Single-port fallback |
|
|
|
|
|
|
| `USMINI` | UniFi Switch Mini | 5-port row |
|
|
|
|
|
|
|
|
|
|
|
|
Add new layouts by adding a key to `SWITCH_LAYOUTS` matching the `model` field
|
|
|
|
|
|
returned by the UniFi API for that device.
|
|
|
|
|
|
|
|
|
|
|
|
### Suppressions (`/suppressions`)
|
|
|
|
|
|
- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
|
|
|
|
|
|
- Target types: host, interface, UniFi device, or global
|
|
|
|
|
|
- Active suppressions table with one-click removal
|
|
|
|
|
|
- Suppression history (last 50)
|
|
|
|
|
|
- Available targets reference grid (all known hosts + interfaces)
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
---
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
## Alert Logic
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
### Ticket Triggers
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
| Condition | Priority |
|
|
|
|
|
|
|---|---|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
| UniFi device offline (≥2 consecutive checks) | P2 High |
|
|
|
|
|
|
| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
|
|
|
|
|
|
| Host unreachable via ping (≥2 consecutive checks) | P2 High |
|
|
|
|
|
|
| ≥3 hosts simultaneously reporting interface failures | P1 Critical |
|
|
|
|
|
|
|
|
|
|
|
|
### Baseline Tracking
|
|
|
|
|
|
|
|
|
|
|
|
Interfaces that are **down on first observation** (unused ports, unplugged cables)
|
|
|
|
|
|
are recorded as `initial_down` and never alerted. Only **UP→DOWN regressions**
|
|
|
|
|
|
generate tickets. Baseline is stored in MariaDB and survives daemon restarts.
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
### Suppression Targets
|
2025-01-04 00:07:15 -05:00
|
|
|
|
|
2026-03-01 23:03:18 -05:00
|
|
|
|
| Type | Suppresses |
|
|
|
|
|
|
|---|---|
|
|
|
|
|
|
| `host` | All interface alerts for a named host |
|
|
|
|
|
|
| `interface` | A specific NIC on a specific host |
|
|
|
|
|
|
| `unifi_device` | A specific UniFi device |
|
|
|
|
|
|
| `all` | Everything (global maintenance mode) |
|
|
|
|
|
|
|
|
|
|
|
|
Suppressions can be manual (persist until removed) or timed (auto-expire).
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
Expired suppressions are checked at evaluation time — no background cleanup needed.
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
## Configuration (`config.json`)
|
|
|
|
|
|
|
|
|
|
|
|
Shared by both processes. Located in the working directory (`/var/www/html/prod/`).
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"database": {
|
|
|
|
|
|
"host": "10.10.10.50",
|
|
|
|
|
|
"port": 3306,
|
|
|
|
|
|
"user": "gandalf",
|
|
|
|
|
|
"password": "...",
|
|
|
|
|
|
"name": "gandalf"
|
|
|
|
|
|
},
|
|
|
|
|
|
"prometheus": {
|
|
|
|
|
|
"url": "http://10.10.10.48:9090"
|
|
|
|
|
|
},
|
|
|
|
|
|
"unifi": {
|
|
|
|
|
|
"controller": "https://10.10.10.1",
|
|
|
|
|
|
"api_key": "...",
|
|
|
|
|
|
"site_id": "default"
|
|
|
|
|
|
},
|
|
|
|
|
|
"ticket_api": {
|
|
|
|
|
|
"url": "https://t.lotusguild.org/api/tickets",
|
|
|
|
|
|
"api_key": "..."
|
|
|
|
|
|
},
|
|
|
|
|
|
"pulse": {
|
|
|
|
|
|
"url": "http://<pulse-host>:<port>",
|
|
|
|
|
|
"api_key": "...",
|
|
|
|
|
|
"worker_id": "...",
|
|
|
|
|
|
"timeout": 45
|
|
|
|
|
|
},
|
|
|
|
|
|
"auth": {
|
|
|
|
|
|
"allowed_groups": ["admin"]
|
|
|
|
|
|
},
|
|
|
|
|
|
"hosts": [
|
|
|
|
|
|
{ "name": "large1", "prometheus_instance": "10.10.10.2:9100" },
|
|
|
|
|
|
{ "name": "compute-storage-01", "prometheus_instance": "10.10.10.4:9100" },
|
|
|
|
|
|
{ "name": "micro1", "prometheus_instance": "10.10.10.8:9100" },
|
|
|
|
|
|
{ "name": "monitor-02", "prometheus_instance": "10.10.10.9:9100" },
|
|
|
|
|
|
{ "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
|
|
|
|
|
|
{ "name": "storage-01", "prometheus_instance": "10.10.10.11:9100" }
|
|
|
|
|
|
],
|
|
|
|
|
|
"monitor": {
|
|
|
|
|
|
"poll_interval": 120,
|
|
|
|
|
|
"failure_threshold": 2,
|
|
|
|
|
|
"cluster_threshold": 3,
|
|
|
|
|
|
"ping_hosts": [
|
|
|
|
|
|
{ "name": "pbs", "ip": "10.10.10.3" }
|
|
|
|
|
|
]
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### Key Config Fields
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
| Key | Description |
|
|
|
|
|
|
|---|---|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
| `database.*` | MariaDB credentials (LXC 149 at 10.10.10.50) |
|
2026-03-01 23:03:18 -05:00
|
|
|
|
| `prometheus.url` | Prometheus base URL |
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
| `unifi.controller` | UniFi controller base URL (HTTPS, self-signed cert ignored) |
|
|
|
|
|
|
| `unifi.api_key` | UniFi API key from controller Settings → API |
|
|
|
|
|
|
| `unifi.site_id` | UniFi site ID (default: `default`) |
|
|
|
|
|
|
| `ticket_api.api_key` | Tinker Tickets bearer token |
|
|
|
|
|
|
| `pulse.url` | Pulse worker API base URL (for SSH relay) |
|
|
|
|
|
|
| `pulse.worker_id` | Which Pulse worker runs ethtool collection |
|
|
|
|
|
|
| `pulse.timeout` | Max seconds to wait for SSH collection per host |
|
|
|
|
|
|
| `auth.allowed_groups` | Authelia groups that may access Gandalf |
|
|
|
|
|
|
| `hosts` | Maps Prometheus instance labels → display hostnames |
|
|
|
|
|
|
| `monitor.poll_interval` | Seconds between full check cycles (default: 120) |
|
|
|
|
|
|
| `monitor.failure_threshold` | Consecutive failures before creating ticket (default: 2) |
|
|
|
|
|
|
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster-wide P1 (default: 3) |
|
|
|
|
|
|
| `monitor.ping_hosts` | Hosts checked only by ping (no node_exporter) |
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Deployment (LXC 157)
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### 1. Database — MariaDB LXC 149 (`10.10.10.50`)
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
|
|
|
|
|
|
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
|
|
|
|
|
|
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
|
|
|
|
|
|
FLUSH PRIVILEGES;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
Import schema:
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```bash
|
|
|
|
|
|
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
|
|
|
|
|
|
```
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### 2. LXC 157 — Install dependencies
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
pip3 install -r requirements.txt
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
# Ensure sshpass is available (used by deploy scripts)
|
|
|
|
|
|
apt install sshpass
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. Deploy files
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
# From dev machine / root/code/gandalf:
|
|
|
|
|
|
for f in app.py db.py monitor.py config.json schema.sql \
|
|
|
|
|
|
static/style.css static/app.js \
|
|
|
|
|
|
templates/*.html; do
|
|
|
|
|
|
sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
|
|
|
|
|
|
"$f" "root@10.10.10.61:/var/www/html/prod/$f"
|
|
|
|
|
|
done
|
|
|
|
|
|
systemctl restart gandalf gandalf-monitor
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### 4. systemd services
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
**`gandalf.service`** (Flask/gunicorn web app):
|
|
|
|
|
|
```ini
|
|
|
|
|
|
[Unit]
|
|
|
|
|
|
Description=Gandalf Web Dashboard
|
|
|
|
|
|
After=network.target
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
[Service]
|
|
|
|
|
|
Type=simple
|
|
|
|
|
|
WorkingDirectory=/var/www/html/prod
|
|
|
|
|
|
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
|
|
|
|
|
|
Restart=always
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
[Install]
|
|
|
|
|
|
WantedBy=multi-user.target
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
**`gandalf-monitor.service`** (background polling daemon):
|
|
|
|
|
|
```ini
|
|
|
|
|
|
[Unit]
|
|
|
|
|
|
Description=Gandalf Network Monitor Daemon
|
|
|
|
|
|
After=network.target
|
|
|
|
|
|
|
|
|
|
|
|
[Service]
|
|
|
|
|
|
Type=simple
|
|
|
|
|
|
WorkingDirectory=/var/www/html/prod
|
|
|
|
|
|
ExecStart=/usr/bin/python3 monitor.py
|
|
|
|
|
|
Restart=always
|
|
|
|
|
|
|
|
|
|
|
|
[Install]
|
|
|
|
|
|
WantedBy=multi-user.target
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### 5. Authelia rule (LXC 167)
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
```yaml
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
access_control:
|
|
|
|
|
|
rules:
|
|
|
|
|
|
- domain: gandalf.lotusguild.org
|
|
|
|
|
|
policy: one_factor
|
|
|
|
|
|
subject:
|
|
|
|
|
|
- group:admin
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
```bash
|
|
|
|
|
|
systemctl restart authelia
|
|
|
|
|
|
```
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### 6. NPM reverse proxy
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
- **Domain:** `gandalf.lotusguild.org`
|
|
|
|
|
|
- **Forward to:** `http://10.10.10.61:8000` (gunicorn direct, no nginx needed on LXC)
|
|
|
|
|
|
- **Forward Auth:** Authelia at `http://10.10.10.167:9091`
|
|
|
|
|
|
- **WebSockets:** Not required
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Service Management
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
# Status
|
|
|
|
|
|
systemctl status gandalf gandalf-monitor
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
# Logs (live)
|
2026-03-01 23:03:18 -05:00
|
|
|
|
journalctl -u gandalf -f
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
journalctl -u gandalf-monitor -f
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
# Restart after code or config changes
|
|
|
|
|
|
systemctl restart gandalf gandalf-monitor
|
2026-03-01 23:03:18 -05:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Troubleshooting
|
|
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### Monitor not creating tickets
|
|
|
|
|
|
- Verify `config.json → ticket_api.api_key` is set and valid
|
|
|
|
|
|
- Check `journalctl -u gandalf-monitor` for `Ticket creation failed` lines
|
|
|
|
|
|
- Confirm the Tinker Tickets API is reachable from LXC 157
|
|
|
|
|
|
|
|
|
|
|
|
### Link Debug shows no data / "Loading…" forever
|
|
|
|
|
|
- Check `gandalf-monitor.service` is running and has completed at least one cycle
|
|
|
|
|
|
- Check `journalctl -u gandalf-monitor` for Prometheus or UniFi errors
|
|
|
|
|
|
- Verify Prometheus is reachable: `curl http://10.10.10.48:9090/api/v1/query?query=up`
|
|
|
|
|
|
|
|
|
|
|
|
### Link Debug: SFP DOM panel missing
|
|
|
|
|
|
- SFP data requires Pulse worker + SSH access to hosts
|
|
|
|
|
|
- Verify `config.json → pulse.*` is configured and the Pulse worker is running
|
|
|
|
|
|
- Confirm `sshpass` + SSH access from the Pulse worker to each Proxmox host
|
|
|
|
|
|
- Only interfaces with physical SFP modules return DOM data (`ethtool -m`)
|
|
|
|
|
|
|
|
|
|
|
|
### Inspector: path debug section not appearing
|
|
|
|
|
|
- Requires LLDP: run `apt install lldpd && systemctl enable --now lldpd` on each server
|
|
|
|
|
|
- The LLDP `system_name` broadcast by `lldpd` must match the hostname in `config.json → hosts[].name`
|
|
|
|
|
|
- Override: `echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd`
|
|
|
|
|
|
- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate
|
|
|
|
|
|
|
|
|
|
|
|
### Inspector: switch chassis shows as flat list (no layout)
|
|
|
|
|
|
- The switch's `model` field from UniFi doesn't match any key in `SWITCH_LAYOUTS` in `inspector.html`
|
|
|
|
|
|
- Check the UniFi API: the model appears in the `link_stats` API response under `unifi_switches.<name>.model`
|
|
|
|
|
|
- Add the model key to `SWITCH_LAYOUTS` in `inspector.html` with the correct row/SFP layout
|
|
|
|
|
|
|
|
|
|
|
|
### Baseline re-initializing on every restart
|
|
|
|
|
|
- `interface_baseline` is stored in the `monitor_state` DB table; survives restarts
|
|
|
|
|
|
- If it appears to reset: check DB connectivity from the monitor daemon
|
|
|
|
|
|
|
|
|
|
|
|
### Interface stuck at "initial_down" forever
|
|
|
|
|
|
- This means the interface was down when the monitor first saw it
|
|
|
|
|
|
- It will begin tracking once it comes up; or manually clear it:
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- In MariaDB on 10.10.10.50:
|
|
|
|
|
|
UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
|
|
|
|
|
|
```
|
|
|
|
|
|
Then restart the monitor: `systemctl restart gandalf-monitor`
|
|
|
|
|
|
|
|
|
|
|
|
### Prometheus data missing for a host
|
|
|
|
|
|
```bash
|
|
|
|
|
|
# On the affected host:
|
|
|
|
|
|
systemctl status prometheus-node-exporter
|
|
|
|
|
|
# Verify it's scraped:
|
|
|
|
|
|
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
|
|
|
|
|
|
```
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
---
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
## Development Notes
|
|
|
|
|
|
|
|
|
|
|
|
### File Layout
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
gandalf/
|
|
|
|
|
|
├── app.py # Flask web app (routes, auth, API endpoints)
|
|
|
|
|
|
├── monitor.py # Background daemon (Prometheus, UniFi, Pulse, alert logic)
|
|
|
|
|
|
├── db.py # Database operations (MariaDB via pymysql, thread-local conn reuse)
|
|
|
|
|
|
├── schema.sql # Database schema (network_events, suppression_rules, monitor_state)
|
|
|
|
|
|
├── config.json # Runtime configuration (not committed with secrets)
|
|
|
|
|
|
├── requirements.txt # Python dependencies
|
|
|
|
|
|
├── static/
|
|
|
|
|
|
│ ├── style.css # Terminal aesthetic CSS (CRT scanlines, green-on-black)
|
|
|
|
|
|
│ └── app.js # Dashboard JS (auto-refresh, host grid, events, suppress modal)
|
|
|
|
|
|
└── templates/
|
|
|
|
|
|
├── base.html # Shared layout (header, nav, footer)
|
|
|
|
|
|
├── index.html # Dashboard page
|
|
|
|
|
|
├── links.html # Link Debug page (server NICs + UniFi switch ports)
|
|
|
|
|
|
├── inspector.html # Visual switch inspector + LLDP path debug
|
|
|
|
|
|
└── suppressions.html # Suppression management page
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Adding a New Monitored Host
|
|
|
|
|
|
|
|
|
|
|
|
1. Install `prometheus-node-exporter` on the host
|
|
|
|
|
|
2. Add a scrape target to Prometheus config
|
|
|
|
|
|
3. Add an entry to `config.json → hosts`:
|
|
|
|
|
|
```json
|
|
|
|
|
|
{ "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
|
|
|
|
|
|
```
|
|
|
|
|
|
4. Restart monitor: `systemctl restart gandalf-monitor`
|
|
|
|
|
|
5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker
|
|
|
|
|
|
|
|
|
|
|
|
### Adding a New Switch Layout (Inspector)
|
|
|
|
|
|
|
|
|
|
|
|
Find the UniFi model code for the switch (it appears in the `/api/links` JSON response
|
|
|
|
|
|
under `unifi_switches.<switch_name>.model`), then add to `SWITCH_LAYOUTS` in
|
|
|
|
|
|
`templates/inspector.html`:
|
|
|
|
|
|
|
|
|
|
|
|
```javascript
|
|
|
|
|
|
'MYNEWMODEL': {
|
|
|
|
|
|
rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]], // port_idx by row
|
|
|
|
|
|
sfp_section: [17, 18], // separate SFP cage ports (rendered below rows)
|
|
|
|
|
|
sfp_ports: [], // port_idx values that are SFP-type within rows
|
|
|
|
|
|
},
|
|
|
|
|
|
```
|
2026-03-01 23:03:18 -05:00
|
|
|
|
|
feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
(USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side
- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
collapsible host/switch panels with sessionStorage persistence
- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
port; PulseClient uses requests.Session() for HTTP keep-alive; add
shlex.quote() around interface names (defense-in-depth)
- Security: suppress buttons use data-* attrs + delegated click handler
instead of inline onclick with Jinja2 variable interpolation; remove
| safe filter from user-controlled fields in suppressions.html;
setDuration() takes explicit el param instead of implicit event global
- db.py: thread-local connection reuse with ping(reconnect=True) to
avoid a new TCP handshake per query
- .gitignore: add config.json (contains credentials), __pycache__
- README: full rewrite covering architecture, all 4 pages, alert logic,
config reference, deployment, troubleshooting, security notes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
|
|
|
|
### Database Schema Notes
|
|
|
|
|
|
|
|
|
|
|
|
- `network_events`: one row per active event; `resolved_at` is set when recovered
|
|
|
|
|
|
- `suppression_rules`: `active=FALSE` when removed; `expires_at` checked at query time
|
|
|
|
|
|
- `monitor_state`: key/value store; `interface_baseline` and `link_stats` are JSON blobs
|
|
|
|
|
|
|
|
|
|
|
|
### Security Notes
|
|
|
|
|
|
|
|
|
|
|
|
- **XSS prevention**: all user-controlled data in dynamically generated HTML uses
|
|
|
|
|
|
`escHtml()` (JS) or Jinja2 auto-escaping (Python). Suppress buttons use `data-*`
|
|
|
|
|
|
attributes + a single delegated click listener rather than inline `onclick` with
|
|
|
|
|
|
interpolated strings.
|
|
|
|
|
|
- **Interface name validation**: `monitor.py` validates SSH interface names against
|
|
|
|
|
|
`^[a-zA-Z0-9_.@-]+$` before use, and additionally wraps them with `shlex.quote()`
|
|
|
|
|
|
for defense-in-depth.
|
|
|
|
|
|
- **DB parameters**: all SQL uses parameterised queries via pymysql — no string
|
|
|
|
|
|
concatenation into SQL.
|
|
|
|
|
|
- **Auth**: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
|
|
|
|
|
|
app additionally checks the `Remote-User` header via `@require_auth`.
|
|
|
|
|
|
|
|
|
|
|
|
### Known Limitations
|
|
|
|
|
|
|
|
|
|
|
|
- Single gunicorn worker (`--workers 1`) — required because `db.py` uses thread-local
|
|
|
|
|
|
connection reuse (one connection per thread). Multiple workers would each have their
|
|
|
|
|
|
own connection, which is fine, but the thread-local optimisation only helps within
|
|
|
|
|
|
one worker.
|
|
|
|
|
|
- No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
|
|
|
|
|
|
`SameSite=Strict` and the site being admin-only.
|
|
|
|
|
|
- SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
|
|
|
|
|
|
is delayed. The `pulse.timeout` config controls the max wait.
|
|
|
|
|
|
- UniFi LLDP data is only as fresh as the last monitor poll (120s default).
|