Files
gandalf/README.md

5.9 KiB
Raw Blame History

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass!

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.


Architecture

Gandalf is two processes that share a MariaDB database:

Process Service Role
app.py gandalf.service Flask web dashboard (gunicorn, port 8000)
monitor.py gandalf-monitor.service Background polling daemon
[Prometheus :9090] ──▶
                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶

Data Sources

Source What it monitors
Prometheus (10.10.10.48:9090) Physical NIC link state (node_network_up) for 6 Proxmox hosts
UniFi API (https://10.10.10.1) Switch, AP, and gateway device status
Ping pbs (10.10.10.3) — no node_exporter

Monitored Hosts (Prometheus / node_exporter)

Host Instance
large1 10.10.10.2:9100
compute-storage-01 10.10.10.4:9100
micro1 10.10.10.8:9100
monitor-02 10.10.10.9:9100
compute-storage-gpu-01 10.10.10.10:9100
storage-01 10.10.10.11:9100

Features

  • Interface monitoring tracks link state for all physical NICs via Prometheus
  • UniFi device monitoring detects offline switches, APs, and gateways
  • Ping reachability covers hosts without node_exporter
  • Cluster-wide detection creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
  • Smart baseline tracking interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
  • Ticket creation integrates with Tinker Tickets (t.lotusguild.org) with 24-hour deduplication
  • Alert suppression manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
  • Authelia SSO restricted to admin group via forward-auth headers

Alert Logic

Ticket Triggers

Condition Priority
UniFi device offline (2+ consecutive checks) P2 High
Proxmox host NIC link-down regression (2+ consecutive checks) P2 High
Host unreachable via ping (2+ consecutive checks) P2 High
3+ hosts simultaneously reporting interface failures P1 Critical

Suppression Targets

Type Suppresses
host All interface alerts for a named host
interface A specific NIC on a specific host
unifi_device A specific UniFi device
all Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire).


Configuration

config.json shared by both processes:

Key Description
unifi.api_key UniFi API key from controller
prometheus.url Prometheus base URL
database.* MariaDB credentials
ticket_api.api_key Tinker Tickets Bearer token
monitor.poll_interval Seconds between checks (default: 120)
monitor.failure_threshold Consecutive failures before ticketing (default: 2)
monitor.cluster_threshold Hosts with failures to trigger cluster alert (default: 3)
monitor.ping_hosts Hosts checked via ping (no node_exporter)
hosts Maps Prometheus instance labels to hostnames

Deployment (LXC 157)

1. Database (MariaDB LXC 149 at 10.10.10.50)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Then import the schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 Install dependencies

pip3 install -r requirements.txt

3. Deploy files

cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/

4. Configure secrets in config.json

  • database.password set the gandalf DB password
  • ticket_api.api_key copy from tinker tickets admin panel

5. Install the monitor service

cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor

Update existing gandalf.service to use a single worker:

ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app

6. Authelia rule

Add to /etc/authelia/configuration.yml access_control rules:

- domain: gandalf.lotusguild.org
  policy: one_factor
  subject:
    - group:admin

Reload Authelia: systemctl restart authelia

7. NPM proxy host

  • Domain: gandalf.lotusguild.org
  • Forward to: http://10.10.10.61:80 (nginx on LXC 157)
  • Enable Authelia forward auth
  • WebSockets: not required

Service Management

# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f

# Web server
systemctl status gandalf
journalctl -u gandalf -f

# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf

Troubleshooting

Monitor not creating tickets

  • Check config.jsonticket_api.api_key is set
  • Check journalctl -u gandalf-monitor for errors

Baseline re-initializing on every restart

  • interface_baseline is stored in the monitor_state DB table; it persists across restarts

Interface always showing as "initial_down"

  • That interface was down on the first poll after the monitor started
  • It will begin tracking once it comes up; or manually update the baseline in DB if needed

Prometheus data missing for a host

  • Verify node_exporter is running: systemctl status prometheus-node-exporter
  • Check Prometheus targets: http://10.10.10.48:9090/targets

2026-03-02T16:58:42Z deploy test