Jared Vititoe 67072099ca docs: update README for storage-01 Prometheus migration
- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
  removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 23:05:27 -05:00
2025-03-01 13:34:25 -05:00

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass!

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.


Architecture

Gandalf is two processes that share a MariaDB database:

Process Service Role
app.py gandalf.service Flask web dashboard (gunicorn, port 8000)
monitor.py gandalf-monitor.service Background polling daemon
[Prometheus :9090] ──▶
                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶

Data Sources

Source What it monitors
Prometheus (10.10.10.48:9090) Physical NIC link state (node_network_up) for 6 Proxmox hosts
UniFi API (https://10.10.10.1) Switch, AP, and gateway device status
Ping pbs (10.10.10.3) — no node_exporter

Monitored Hosts (Prometheus / node_exporter)

Host Instance
large1 10.10.10.2:9100
compute-storage-01 10.10.10.4:9100
micro1 10.10.10.8:9100
monitor-02 10.10.10.9:9100
compute-storage-gpu-01 10.10.10.10:9100
storage-01 10.10.10.11:9100

Features

  • Interface monitoring tracks link state for all physical NICs via Prometheus
  • UniFi device monitoring detects offline switches, APs, and gateways
  • Ping reachability covers hosts without node_exporter
  • Cluster-wide detection creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
  • Smart baseline tracking interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
  • Ticket creation integrates with Tinker Tickets (t.lotusguild.org) with 24-hour deduplication
  • Alert suppression manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
  • Authelia SSO restricted to admin group via forward-auth headers

Alert Logic

Ticket Triggers

Condition Priority
UniFi device offline (2+ consecutive checks) P2 High
Proxmox host NIC link-down regression (2+ consecutive checks) P2 High
Host unreachable via ping (2+ consecutive checks) P2 High
3+ hosts simultaneously reporting interface failures P1 Critical

Suppression Targets

Type Suppresses
host All interface alerts for a named host
interface A specific NIC on a specific host
unifi_device A specific UniFi device
all Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire).


Configuration

config.json shared by both processes:

Key Description
unifi.api_key UniFi API key from controller
prometheus.url Prometheus base URL
database.* MariaDB credentials
ticket_api.api_key Tinker Tickets Bearer token
monitor.poll_interval Seconds between checks (default: 120)
monitor.failure_threshold Consecutive failures before ticketing (default: 2)
monitor.cluster_threshold Hosts with failures to trigger cluster alert (default: 3)
monitor.ping_hosts Hosts checked via ping (no node_exporter)
hosts Maps Prometheus instance labels to hostnames

Deployment (LXC 157)

1. Database (MariaDB LXC 149 at 10.10.10.50)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Then import the schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 Install dependencies

pip3 install -r requirements.txt

3. Deploy files

cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/

4. Configure secrets in config.json

  • database.password set the gandalf DB password
  • ticket_api.api_key copy from tinker tickets admin panel

5. Install the monitor service

cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor

Update existing gandalf.service to use a single worker:

ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app

6. Authelia rule

Add to /etc/authelia/configuration.yml access_control rules:

- domain: gandalf.lotusguild.org
  policy: one_factor
  subject:
    - group:admin

Reload Authelia: systemctl restart authelia

7. NPM proxy host

  • Domain: gandalf.lotusguild.org
  • Forward to: http://10.10.10.61:80 (nginx on LXC 157)
  • Enable Authelia forward auth
  • WebSockets: not required

Service Management

# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f

# Web server
systemctl status gandalf
journalctl -u gandalf -f

# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf

Troubleshooting

Monitor not creating tickets

  • Check config.jsonticket_api.api_key is set
  • Check journalctl -u gandalf-monitor for errors

Baseline re-initializing on every restart

  • interface_baseline is stored in the monitor_state DB table; it persists across restarts

Interface always showing as "initial_down"

  • That interface was down on the first poll after the monitor started
  • It will begin tracking once it comes up; or manually update the baseline in DB if needed

Prometheus data missing for a host

  • Verify node_exporter is running: systemctl status prometheus-node-exporter
  • Check Prometheus targets: http://10.10.10.48:9090/targets
Description
GANDALF (Global Advanced Network Detection And Link Facilitator)
Readme 827 KiB
Languages
HTML 35.9%
Python 34.3%
CSS 24.6%
JavaScript 5.2%