LotusGuild/gandalf

Fork 0

T

jared 9c5a88fbce

Lint / Python (flake8) (push) Successful in 41s

Details

Lint / JS (eslint) (push) Successful in 7s

Details

Security / Python Security (bandit) (push) Successful in 40s

Details

Test / Python Tests (pytest) (push) Successful in 1m18s

Details

Lint / Notify on failure (push) Has been skipped

Details

Lint / Deploy (push) Successful in 4s

Details

Guard ticket creation against duplicates using event's existing ticket_id

upsert_event now returns ticket_id (4th element) so callers can skip
ticket creation when one already exists. This prevents calling the ticket
API every poll cycle for ongoing issues while still retrying if the
previous creation attempt failed (ticket_id stays NULL until success).

Cluster events use (is_new or not ticket_id) so they too get retried
on failure rather than relying solely on is_new.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-14 11:09:50 -04:00

.gitea/workflows

ci: add notify-failure, deploy tagging, and coverage reporting

2026-04-14 15:16:02 -04:00

static

fix: escape ticket_id text content in dynamic events table

2026-05-11 23:02:09 -04:00

templates

Fix inspector auto-refresh ignoring 'Off' setting on page load

2026-05-13 13:20:42 -04:00

tests

Fix monitor loop double-sleep on error; add grep -F regression test

2026-05-13 13:16:43 -04:00

.coveragerc

ci: add notify-failure, deploy tagging, and coverage reporting

2026-04-14 15:16:02 -04:00

.eslintrc.json

Add ESLint config enforcing no-undef and eqeqeq

2026-05-13 15:33:26 -04:00

.flake8

fix: resolve bandit B324/B104 and flake8 E302/E303/E501 in app.py

2026-04-25 20:51:41 -04:00

.gitignore

Add JS linting and deploy gating to CI pipeline

2026-04-14 10:14:33 -04:00

app.py

Fix misleading docstring on _purge_old_jobs_loop

2026-05-14 11:06:28 -04:00

config.json

Switch LDAP bind to dedicated gandalf service account

2026-04-30 21:21:04 -04:00

db.py

Guard ticket creation against duplicates using event's existing ticket_id

2026-05-14 11:09:50 -04:00

diagnose.py

Use grep -F in dmesg filter to prevent interface name treated as regex

2026-05-13 11:12:02 -04:00

gandalf-monitor.service

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

monitor.py

Guard ticket creation against duplicates using event's existing ticket_id

2026-05-14 11:09:50 -04:00

package-lock.json

Add JS linting and deploy gating to CI pipeline

2026-04-14 10:14:33 -04:00

package.json

Add JS linting and deploy gating to CI pipeline

2026-04-14 10:14:33 -04:00

README.md

Add CI badges and CI/CD section to README

2026-04-14 12:53:47 -04:00

requirements.txt

Add LDAP avatar photos, UX polish, and TDS component upgrades

2026-04-30 21:09:56 -04:00

schema.sql

Add compound DB indexes for hot query paths

2026-03-14 14:24:40 -04:00

README.md

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass.

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.

Design System: web_template — shared CSS, JS, and layout patterns for all LotusGuild apps

Styling & Layout

GANDALF uses the LotusGuild Terminal Design System. For all styling, component, and layout documentation see:

web_template/README.md — full component reference, CSS variables, JS API
web_template/base.css — unified CSS (.lt-* classes)
web_template/base.js — window.lt utilities (toast, modal, auto-refresh, fetch helpers)
web_template/python/base.html — Jinja2 base template
web_template/python/auth.py — @require_auth decorator pattern

Architecture

Two processes share a MariaDB database:

Process	Service	Role
`app.py`	`gandalf.service`	Flask web dashboard (gunicorn, port 8000)
`monitor.py`	`gandalf-monitor.service`	Background polling daemon

[Prometheus :9090]  ──▶
[UniFi Controller]  ──▶  monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker]      ──▶
[SSH / ethtool]     ──▶

Data Sources

Source	What it provides
Prometheus (`10.10.10.48:9090`)	Physical NIC link state + traffic/error rates via `node_exporter`
UniFi API (`https://10.10.10.1`)	Switch port stats, device status, LLDP neighbor table, PoE data
Pulse Worker	SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host
Ping	Reachability for hosts without `node_exporter` (e.g. PBS)

Monitored Hosts (Prometheus / node_exporter)

Host	Prometheus Instance
large1	10.10.10.2:9100
compute-storage-01	10.10.10.4:9100
micro1	10.10.10.8:9100
monitor-02	10.10.10.9:9100
compute-storage-gpu-01	10.10.10.10:9100
storage-01	10.10.10.11:9100

Ping-only (no node_exporter): pbs (10.10.10.3)

Pages

Dashboard (`/`)

Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
Network topology diagram (Internet → Gateway → Switches → Hosts)
UniFi device table (switches, APs, gateway)
Active alerts table with severity, target, consecutive failures, ticket link
Quick-suppress modal: apply timed or manual suppression from any alert row
Auto-refreshes every 30 seconds via /api/status + /api/network

Link Debug (`/links`)

Per-interface statistics collected every poll cycle. All panels are collapsible (click header or use Collapse All / Expand All). Collapse state persists across page refreshes via sessionStorage.

Server NICs (via Prometheus + SSH/ethtool):

Speed, duplex, auto-negotiation, link detected
TX/RX rate bars (bandwidth utilisation % of link capacity)
TX/RX error and drop rates per second
Carrier changes (cumulative since boot — watch for flapping)
SFP / Optical panel (when SFP module present): vendor/PN, temp, voltage, bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars

UniFi Switch Ports (via UniFi API):

Port number badge (#N), UPLINK badge, PoE draw badge
LLDP neighbor line: → system_name (port_id) when neighbor is detected
PoE class and max wattage line
Speed, duplex, auto-neg, TX/RX rates, errors, drops

Inspector (`/inspector`)

Visual switch chassis diagrams. Each switch is rendered model-accurately using layout config in the template (SWITCH_LAYOUTS).

Port block colours:

Colour	State
Green	Up, no active PoE
Amber	Up with active PoE draw
Cyan	Uplink port (up)
Grey	Down
White outline	Currently selected

Clicking a port opens the right-side detail panel showing:

Link stats (status, speed, duplex, auto-neg, media type)
PoE (class, max wattage, current draw, mode)
Traffic (TX/RX rates)
Errors/drops per second
LLDP Neighbor section (system name, port ID, chassis ID, management IPs)
Path Debug (auto-appears when LLDP system_name matches a known server): two-column comparison of the switch port stats vs. the server NIC stats, including SFP DOM data if the server side has an SFP module

LLDP path debug requirements:

Server must run lldpd: apt install lldpd && systemctl enable --now lldpd
lldpd hostname must match the key in data.hosts (set via config.json → hosts)
Switch has LLDP enabled (UniFi default: on)

Supported switch models (set SWITCH_LAYOUTS keys to your UniFi model codes):

Key	Model	Layout
`USF5P`	UniFi Switch Flex 5 PoE	4×RJ45 + 1×SFP uplink
`USL8A`	UniFi Switch Lite 8 PoE	8×SFP (2 rows of 4)
`US24PRO`	UniFi Switch Pro 24	24×RJ45 staggered + 2×SFP
`USPPDUP`	Custom/other	Single-port fallback
`USMINI`	UniFi Switch Mini	5-port row

Add new layouts by adding a key to SWITCH_LAYOUTS matching the model field returned by the UniFi API for that device.

Suppressions (`/suppressions`)

Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
Target types: host, interface, UniFi device, or global
Active suppressions table with one-click removal
Suppression history (last 50)
Available targets reference grid (all known hosts + interfaces)

Alert Logic

Ticket Triggers

Condition	Priority
UniFi device offline (≥2 consecutive checks)	P2 High
Proxmox host NIC link-down regression (≥2 consecutive checks)	P2 High
Host unreachable via ping (≥2 consecutive checks)	P2 High
≥3 hosts simultaneously reporting interface failures	P1 Critical

Baseline Tracking

Interfaces that are down on first observation (unused ports, unplugged cables) are recorded as initial_down and never alerted. Only UP→DOWN regressions generate tickets. Baseline is stored in MariaDB and survives daemon restarts.

Suppression Targets

Type	Suppresses
`host`	All interface alerts for a named host
`interface`	A specific NIC on a specific host
`unifi_device`	A specific UniFi device
`all`	Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire). Expired suppressions are checked at evaluation time — no background cleanup needed.

Configuration (`config.json`)

Shared by both processes. Located in the working directory (/var/www/html/prod/).

{
  "database": {
    "host": "10.10.10.50",
    "port": 3306,
    "user": "gandalf",
    "password": "...",
    "name": "gandalf"
  },
  "prometheus": {
    "url": "http://10.10.10.48:9090"
  },
  "unifi": {
    "controller": "https://10.10.10.1",
    "api_key": "...",
    "site_id": "default"
  },
  "ticket_api": {
    "url": "https://t.lotusguild.org/api/tickets",
    "api_key": "..."
  },
  "pulse": {
    "url": "http://<pulse-host>:<port>",
    "api_key": "...",
    "worker_id": "...",
    "timeout": 45
  },
  "auth": {
    "allowed_groups": ["admin"]
  },
  "hosts": [
    { "name": "large1",               "prometheus_instance": "10.10.10.2:9100" },
    { "name": "compute-storage-01",   "prometheus_instance": "10.10.10.4:9100" },
    { "name": "micro1",               "prometheus_instance": "10.10.10.8:9100" },
    { "name": "monitor-02",           "prometheus_instance": "10.10.10.9:9100" },
    { "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
    { "name": "storage-01",           "prometheus_instance": "10.10.10.11:9100" }
  ],
  "monitor": {
    "poll_interval": 120,
    "failure_threshold": 2,
    "cluster_threshold": 3,
    "ping_hosts": [
      { "name": "pbs", "ip": "10.10.10.3" }
    ]
  }
}

Key Config Fields

Key	Description
`database.*`	MariaDB credentials (LXC 149 at 10.10.10.50)
`prometheus.url`	Prometheus base URL
`unifi.controller`	UniFi controller base URL (HTTPS, self-signed cert ignored)
`unifi.api_key`	UniFi API key from controller Settings → API
`unifi.site_id`	UniFi site ID (default: `default`)
`ticket_api.api_key`	Tinker Tickets bearer token
`pulse.url`	Pulse worker API base URL (for SSH relay)
`pulse.worker_id`	Which Pulse worker runs ethtool collection
`pulse.timeout`	Max seconds to wait for SSH collection per host
`auth.allowed_groups`	Authelia groups that may access Gandalf
`hosts`	Maps Prometheus instance labels → display hostnames
`monitor.poll_interval`	Seconds between full check cycles (default: 120)
`monitor.failure_threshold`	Consecutive failures before creating ticket (default: 2)
`monitor.cluster_threshold`	Hosts with failures to trigger cluster-wide P1 (default: 3)
`monitor.ping_hosts`	Hosts checked only by ping (no node_exporter)

Deployment (LXC 157)

1. Database — MariaDB LXC 149 (`10.10.10.50`)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Import schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 — Install dependencies

pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass

3. Deploy files

# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
          static/style.css static/app.js \
          templates/*.html; do
  sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
    "$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor

4. systemd services

gandalf.service (Flask/gunicorn web app):

[Unit]
Description=Gandalf Web Dashboard
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always

[Install]
WantedBy=multi-user.target

gandalf-monitor.service (background polling daemon):

[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Authelia rule (LXC 167)

access_control:
  rules:
    - domain: gandalf.lotusguild.org
      policy: one_factor
      subject:
        - group:admin

systemctl restart authelia

6. NPM reverse proxy

Domain: gandalf.lotusguild.org
Forward to: http://10.10.10.61:8000 (gunicorn direct, no nginx needed on LXC)
Forward Auth: Authelia at http://10.10.10.167:9091
WebSockets: Not required

Service Management

# Status
systemctl status gandalf gandalf-monitor

# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f

# Restart after code or config changes
systemctl restart gandalf gandalf-monitor

Troubleshooting

Monitor not creating tickets

Verify config.json → ticket_api.api_key is set and valid
Check journalctl -u gandalf-monitor for Ticket creation failed lines
Confirm the Tinker Tickets API is reachable from LXC 157

Link Debug shows no data / "Loading…" forever

Check gandalf-monitor.service is running and has completed at least one cycle
Check journalctl -u gandalf-monitor for Prometheus or UniFi errors
Verify Prometheus is reachable: curl http://10.10.10.48:9090/api/v1/query?query=up

Link Debug: SFP DOM panel missing

SFP data requires Pulse worker + SSH access to hosts
Verify config.json → pulse.* is configured and the Pulse worker is running
Confirm sshpass + SSH access from the Pulse worker to each Proxmox host
Only interfaces with physical SFP modules return DOM data (ethtool -m)

Inspector: path debug section not appearing

Requires LLDP: run apt install lldpd && systemctl enable --now lldpd on each server
The LLDP system_name broadcast by lldpd must match the hostname in config.json → hosts[].name
- Override: echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd
Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate

Inspector: switch chassis shows as flat list (no layout)

The switch's model field from UniFi doesn't match any key in SWITCH_LAYOUTS in inspector.html
Check the UniFi API: the model appears in the link_stats API response under unifi_switches.<name>.model
Add the model key to SWITCH_LAYOUTS in inspector.html with the correct row/SFP layout

Baseline re-initializing on every restart

interface_baseline is stored in the monitor_state DB table; survives restarts
If it appears to reset: check DB connectivity from the monitor daemon

Interface stuck at "initial_down" forever

This means the interface was down when the monitor first saw it
It will begin tracking once it comes up; or manually clear it:
```
-- In MariaDB on 10.10.10.50:
UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
```
Then restart the monitor: systemctl restart gandalf-monitor

Prometheus data missing for a host

# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'

Development Notes

File Layout

gandalf/
├── app.py              # Flask web app (routes, auth, API endpoints)
├── monitor.py          # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py               # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql          # Database schema (network_events, suppression_rules, monitor_state)
├── config.json         # Runtime configuration (not committed with secrets)
├── requirements.txt    # Python dependencies
├── static/
│   ├── style.css       # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│   └── app.js          # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
    ├── base.html       # Shared layout (header, nav, footer)
    ├── index.html      # Dashboard page
    ├── links.html      # Link Debug page (server NICs + UniFi switch ports)
    ├── inspector.html  # Visual switch inspector + LLDP path debug
    └── suppressions.html # Suppression management page

Adding a New Monitored Host

Install prometheus-node-exporter on the host
Add a scrape target to Prometheus config

Add an entry to config.json → hosts:

{ "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }

Restart monitor: systemctl restart gandalf-monitor
For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker

Adding a New Switch Layout (Inspector)

Find the UniFi model code for the switch (it appears in the /api/links JSON response under unifi_switches.<switch_name>.model), then add to SWITCH_LAYOUTS in templates/inspector.html:

'MYNEWMODEL': {
  rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]],  // port_idx by row
  sfp_section: [17, 18],  // separate SFP cage ports (rendered below rows)
  sfp_ports: [],          // port_idx values that are SFP-type within rows
},

Database Schema Notes

network_events: one row per active event; resolved_at is set when recovered
suppression_rules: active=FALSE when removed; expires_at checked at query time
monitor_state: key/value store; interface_baseline and link_stats are JSON blobs

Security Notes

XSS prevention: all user-controlled data in dynamically generated HTML uses escHtml() (JS) or Jinja2 auto-escaping (Python). Suppress buttons use data-* attributes + a single delegated click listener rather than inline onclick with interpolated strings.
Interface name validation: monitor.py validates SSH interface names against ^[a-zA-Z0-9_.@-]+$ before use, and additionally wraps them with shlex.quote() for defense-in-depth.
DB parameters: all SQL uses parameterised queries via pymysql — no string concatenation into SQL.
Auth: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask app additionally checks the Remote-User header via @require_auth.

Known Limitations

Single gunicorn worker (--workers 1) — required because db.py uses thread-local connection reuse (one connection per thread). Multiple workers would each have their own connection, which is fine, but the thread-local optimisation only helps within one worker.
No CSRF tokens on API endpoints — mitigated by Authelia session cookies being SameSite=Strict and the site being admin-only.
SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle is delayed. The pulse.timeout config controls the max wait.
UniFi LLDP data is only as fresh as the last monitor poll (120s default).

CI / CD

Workflow	Purpose	Triggers
`lint.yml` (python-lint)	flake8 on all `.py` files	Every push and PR
`lint.yml` (js-lint)	ESLint on `static/`	Every push and PR
`test.yml`	pytest — 33 tests for `diagnose.py` static methods	Every push and PR
`security.yml`	bandit `-ll` (medium+ severity)	Every push, PR, and weekly Monday 6am
`deploy` job in `lint.yml`	Calls the `gandalf-deploy` webhook on CT157 (10.10.10.61)	Push to `main` only, after both lint jobs pass

Branch protection is enabled on main — both lint jobs must pass before any PR can merge.

Tests live in tests/test_diagnose.py and cover DiagnosticsRunner static methods: build_ssh_command, parse_output, parse_sysfs_stats, parse_ethtool, and variants.

README.md Unescape Escape

GANDALF (Global Advanced Network Detection And Link Facilitator)

Styling & Layout

Architecture

Data Sources

Monitored Hosts (Prometheus / node_exporter)

Pages

Dashboard (/)

Link Debug (/links)

Inspector (/inspector)

Suppressions (/suppressions)

Alert Logic

Ticket Triggers

Baseline Tracking

Suppression Targets

Configuration (config.json)

Key Config Fields

Deployment (LXC 157)

1. Database — MariaDB LXC 149 (10.10.10.50)

2. LXC 157 — Install dependencies

3. Deploy files

4. systemd services

5. Authelia rule (LXC 167)

6. NPM reverse proxy

Service Management

Troubleshooting

Monitor not creating tickets

Link Debug shows no data / "Loading…" forever

Link Debug: SFP DOM panel missing

Inspector: path debug section not appearing

Inspector: switch chassis shows as flat list (no layout)

Baseline re-initializing on every restart

Interface stuck at "initial_down" forever

Prometheus data missing for a host

Development Notes

File Layout

Adding a New Monitored Host

Adding a New Switch Layout (Inspector)

Database Schema Notes

Security Notes

Known Limitations

CI / CD

README.md

Dashboard (`/`)

Link Debug (`/links`)

Inspector (`/inspector`)

Suppressions (`/suppressions`)

Configuration (`config.json`)

1. Database — MariaDB LXC 149 (`10.10.10.50`)