Files

Jared Vititoe ff1edb5e0f chore: trigger deploy test

2026-03-02 11:58:42 -05:00

5.9 KiB

Raw Blame History

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass!

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.

Architecture

Gandalf is two processes that share a MariaDB database:

Process	Service	Role
`app.py`	`gandalf.service`	Flask web dashboard (gunicorn, port 8000)
`monitor.py`	`gandalf-monitor.service`	Background polling daemon

[Prometheus :9090] ──▶
                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶

Data Sources

Source	What it monitors
Prometheus (`10.10.10.48:9090`)	Physical NIC link state (`node_network_up`) for 6 Proxmox hosts
UniFi API (`https://10.10.10.1`)	Switch, AP, and gateway device status
Ping	pbs (10.10.10.3) — no node_exporter

Monitored Hosts (Prometheus / node_exporter)

Host	Instance
large1	10.10.10.2:9100
compute-storage-01	10.10.10.4:9100
micro1	10.10.10.8:9100
monitor-02	10.10.10.9:9100
compute-storage-gpu-01	10.10.10.10:9100
storage-01	10.10.10.11:9100

Features

Interface monitoring – tracks link state for all physical NICs via Prometheus
UniFi device monitoring – detects offline switches, APs, and gateways
Ping reachability – covers hosts without node_exporter
Cluster-wide detection – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
Smart baseline tracking – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
Ticket creation – integrates with Tinker Tickets (t.lotusguild.org) with 24-hour deduplication
Alert suppression – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
Authelia SSO – restricted to admin group via forward-auth headers

Alert Logic

Ticket Triggers

Condition	Priority
UniFi device offline (2+ consecutive checks)	P2 High
Proxmox host NIC link-down regression (2+ consecutive checks)	P2 High
Host unreachable via ping (2+ consecutive checks)	P2 High
3+ hosts simultaneously reporting interface failures	P1 Critical

Suppression Targets

Type	Suppresses
`host`	All interface alerts for a named host
`interface`	A specific NIC on a specific host
`unifi_device`	A specific UniFi device
`all`	Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire).

Configuration

config.json – shared by both processes:

Key	Description
`unifi.api_key`	UniFi API key from controller
`prometheus.url`	Prometheus base URL
`database.*`	MariaDB credentials
`ticket_api.api_key`	Tinker Tickets Bearer token
`monitor.poll_interval`	Seconds between checks (default: 120)
`monitor.failure_threshold`	Consecutive failures before ticketing (default: 2)
`monitor.cluster_threshold`	Hosts with failures to trigger cluster alert (default: 3)
`monitor.ping_hosts`	Hosts checked via ping (no node_exporter)
`hosts`	Maps Prometheus instance labels to hostnames

Deployment (LXC 157)

1. Database (MariaDB LXC 149 at 10.10.10.50)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Then import the schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 – Install dependencies

pip3 install -r requirements.txt

3. Deploy files

cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/

4. Configure secrets in `config.json`

database.password – set the gandalf DB password
ticket_api.api_key – copy from tinker tickets admin panel

5. Install the monitor service

cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor

Update existing gandalf.service to use a single worker:

ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app

6. Authelia rule

Add to /etc/authelia/configuration.yml access_control rules:

- domain: gandalf.lotusguild.org
  policy: one_factor
  subject:
    - group:admin

Reload Authelia: systemctl restart authelia

7. NPM proxy host

Domain: gandalf.lotusguild.org
Forward to: http://10.10.10.61:80 (nginx on LXC 157)
Enable Authelia forward auth
WebSockets: not required

Service Management

# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f

# Web server
systemctl status gandalf
journalctl -u gandalf -f

# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf

Troubleshooting

Monitor not creating tickets

Check config.json → ticket_api.api_key is set
Check journalctl -u gandalf-monitor for errors

Baseline re-initializing on every restart

interface_baseline is stored in the monitor_state DB table; it persists across restarts

Interface always showing as "initial_down"

That interface was down on the first poll after the monitor started
It will begin tracking once it comes up; or manually update the baseline in DB if needed

Prometheus data missing for a host

Verify node_exporter is running: systemctl status prometheus-node-exporter
Check Prometheus targets: http://10.10.10.48:9090/targets

2026-03-02T16:58:42Z deploy test

5.9 KiB Raw Blame History Unescape Escape