Files
gandalf/README.md

203 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GANDALF (Global Advanced Network Detection And Link Facilitator)
> Because it shall not let problems pass!
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.
---
## Architecture
Gandalf is two processes that share a MariaDB database:
| Process | Service | Role |
|---|---|---|
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
```
[Prometheus :9090] ──▶
monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶
```
### Data Sources
| Source | What it monitors |
|---|---|
| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state (`node_network_up`) for 6 Proxmox hosts |
| **UniFi API** (`https://10.10.10.1`) | Switch, AP, and gateway device status |
| **Ping** | pbs (10.10.10.3) — no node_exporter |
### Monitored Hosts (Prometheus / node_exporter)
| Host | Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |
---
## Features
- **Interface monitoring** tracks link state for all physical NICs via Prometheus
- **UniFi device monitoring** detects offline switches, APs, and gateways
- **Ping reachability** covers hosts without node_exporter
- **Cluster-wide detection** creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
- **Smart baseline tracking** interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
- **Ticket creation** integrates with Tinker Tickets (`t.lotusguild.org`) with 24-hour deduplication
- **Alert suppression** manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
- **Authelia SSO** restricted to `admin` group via forward-auth headers
---
## Alert Logic
### Ticket Triggers
| Condition | Priority |
|---|---|
| UniFi device offline (2+ consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
| Host unreachable via ping (2+ consecutive checks) | P2 High |
| 3+ hosts simultaneously reporting interface failures | P1 Critical |
### Suppression Targets
| Type | Suppresses |
|---|---|
| `host` | All interface alerts for a named host |
| `interface` | A specific NIC on a specific host |
| `unifi_device` | A specific UniFi device |
| `all` | Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire).
---
## Configuration
**`config.json`** shared by both processes:
| Key | Description |
|---|---|
| `unifi.api_key` | UniFi API key from controller |
| `prometheus.url` | Prometheus base URL |
| `database.*` | MariaDB credentials |
| `ticket_api.api_key` | Tinker Tickets Bearer token |
| `monitor.poll_interval` | Seconds between checks (default: 120) |
| `monitor.failure_threshold` | Consecutive failures before ticketing (default: 2) |
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster alert (default: 3) |
| `monitor.ping_hosts` | Hosts checked via ping (no node_exporter) |
| `hosts` | Maps Prometheus instance labels to hostnames |
---
## Deployment (LXC 157)
### 1. Database (MariaDB LXC 149 at 10.10.10.50)
```sql
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
```
Then import the schema:
```bash
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
```
### 2. LXC 157 Install dependencies
```bash
pip3 install -r requirements.txt
```
### 3. Deploy files
```bash
cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
```
### 4. Configure secrets in `config.json`
- `database.password` set the gandalf DB password
- `ticket_api.api_key` copy from tinker tickets admin panel
### 5. Install the monitor service
```bash
cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor
```
Update existing `gandalf.service` to use a single worker:
```
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
```
### 6. Authelia rule
Add to `/etc/authelia/configuration.yml` access_control rules:
```yaml
- domain: gandalf.lotusguild.org
policy: one_factor
subject:
- group:admin
```
Reload Authelia: `systemctl restart authelia`
### 7. NPM proxy host
- Domain: `gandalf.lotusguild.org`
- Forward to: `http://10.10.10.61:80` (nginx on LXC 157)
- Enable Authelia forward auth
- WebSockets: **not required**
---
## Service Management
```bash
# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f
# Web server
systemctl status gandalf
journalctl -u gandalf -f
# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf
```
---
## Troubleshooting
**Monitor not creating tickets**
- Check `config.json``ticket_api.api_key` is set
- Check `journalctl -u gandalf-monitor` for errors
**Baseline re-initializing on every restart**
- `interface_baseline` is stored in the `monitor_state` DB table; it persists across restarts
**Interface always showing as "initial_down"**
- That interface was down on the first poll after the monitor started
- It will begin tracking once it comes up; or manually update the baseline in DB if needed
**Prometheus data missing for a host**
- Verify node_exporter is running: `systemctl status prometheus-node-exporter`
- Check Prometheus targets: `http://10.10.10.48:9090/targets`
# auto-deploy: webhook + deploy script