README.md

# GANDALF (Global Advanced Network Detection And Link Facilitator)

> Because it shall not let problems pass!

Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.

---

## Architecture

Gandalf is two processes that share a MariaDB database:

| Process | Service | Role |
|---|---|---|
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |

```
[Prometheus :9090] ──▶
                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶
```

### Data Sources

| Source | What it monitors |
|---|---|
| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state (`node_network_up`) for 6 Proxmox hosts |
| **UniFi API** (`https://10.10.10.1`) | Switch, AP, and gateway device status |
| **Ping** | pbs (10.10.10.3) — no node_exporter |

### Monitored Hosts (Prometheus / node_exporter)

| Host | Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |

---

## Features

- **Interface monitoring** – tracks link state for all physical NICs via Prometheus
- **UniFi device monitoring** – detects offline switches, APs, and gateways
- **Ping reachability** – covers hosts without node_exporter
- **Cluster-wide detection** – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
- **Smart baseline tracking** – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
- **Ticket creation** – integrates with Tinker Tickets (`t.lotusguild.org`) with 24-hour deduplication
- **Alert suppression** – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
- **Authelia SSO** – restricted to `admin` group via forward-auth headers

---

## Alert Logic

### Ticket Triggers

| Condition | Priority |
|---|---|
| UniFi device offline (2+ consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
| Host unreachable via ping (2+ consecutive checks) | P2 High |
| 3+ hosts simultaneously reporting interface failures | P1 Critical |

### Suppression Targets

| Type | Suppresses |
|---|---|
| `host` | All interface alerts for a named host |
| `interface` | A specific NIC on a specific host |
| `unifi_device` | A specific UniFi device |
| `all` | Everything (global maintenance mode) |

Suppressions can be manual (persist until removed) or timed (auto-expire).

---

## Configuration

**`config.json`** – shared by both processes:

| Key | Description |
|---|---|
| `unifi.api_key` | UniFi API key from controller |
| `prometheus.url` | Prometheus base URL |
| `database.*` | MariaDB credentials |
| `ticket_api.api_key` | Tinker Tickets Bearer token |
| `monitor.poll_interval` | Seconds between checks (default: 120) |
| `monitor.failure_threshold` | Consecutive failures before ticketing (default: 2) |
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster alert (default: 3) |
| `monitor.ping_hosts` | Hosts checked via ping (no node_exporter) |
| `hosts` | Maps Prometheus instance labels to hostnames |

---

## Deployment (LXC 157)

### 1. Database (MariaDB LXC 149 at 10.10.10.50)

```sql
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
```

Then import the schema:
```bash
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
```

### 2. LXC 157 – Install dependencies

```bash
pip3 install -r requirements.txt
```

### 3. Deploy files

```bash
cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
```

### 4. Configure secrets in `config.json`

- `database.password` – set the gandalf DB password
- `ticket_api.api_key` – copy from tinker tickets admin panel

### 5. Install the monitor service

```bash
cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor
```

Update existing `gandalf.service` to use a single worker:
```
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
```

### 6. Authelia rule

Add to `/etc/authelia/configuration.yml` access_control rules:
```yaml
- domain: gandalf.lotusguild.org
  policy: one_factor
  subject:
    - group:admin
```

Reload Authelia: `systemctl restart authelia`

### 7. NPM proxy host

- Domain: `gandalf.lotusguild.org`
- Forward to: `http://10.10.10.61:80` (nginx on LXC 157)
- Enable Authelia forward auth
- WebSockets: **not required**

---

## Service Management

```bash
# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f

# Web server
systemctl status gandalf
journalctl -u gandalf -f

# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf
```

---

## Troubleshooting

**Monitor not creating tickets**
- Check `config.json` → `ticket_api.api_key` is set
- Check `journalctl -u gandalf-monitor` for errors

**Baseline re-initializing on every restart**
- `interface_baseline` is stored in the `monitor_state` DB table; it persists across restarts

**Interface always showing as "initial_down"**
- That interface was down on the first poll after the monitor started
- It will begin tracking once it comes up; or manually update the baseline in DB if needed

**Prometheus data missing for a host**
- Verify node_exporter is running: `systemctl status prometheus-node-exporter`
- Check Prometheus targets: `http://10.10.10.48:9090/targets`
# 2026-03-02T16:58:42Z deploy test
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
+								# GANDALF (Global Advanced Network Detection And Link Facilitator)
 								> Because it shall not let problems pass!
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								Network monitoring dashboard for the LotusGuild Proxmox cluster.
 								Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								---
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								## Architecture
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								Gandalf is two processes that share a MariaDB database:
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Process | Service | Role |
 								|---|---|---|
 								| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
 								| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
 								[Prometheus :9090] ──▶
 								                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
 								[UniFi Controller] ──▶
 								```
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Data Sources
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Source | What it monitors |
 								|---|---|
-												docs: update README for storage-01 Prometheus migration

- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
  removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:05:27 -05:00
+								| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state (`node_network_up`) for 6 Proxmox hosts |
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| **UniFi API** (`https://10.10.10.1`) | Switch, AP, and gateway device status |
-												docs: update README for storage-01 Prometheus migration

- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
  removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:05:27 -05:00
+								| **Ping** | pbs (10.10.10.3) — no node_exporter |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Monitored Hosts (Prometheus / node_exporter)
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Host | Instance |
 								|---|---|
 								| large1 | 10.10.10.2:9100 |
 								| compute-storage-01 | 10.10.10.4:9100 |
 								| micro1 | 10.10.10.8:9100 |
 								| monitor-02 | 10.10.10.9:9100 |
 								| compute-storage-gpu-01 | 10.10.10.10:9100 |
-												docs: update README for storage-01 Prometheus migration

- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
  removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:05:27 -05:00
+								| storage-01 | 10.10.10.11:9100 |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								---
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								## Features
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								- **Interface monitoring** – tracks link state for all physical NICs via Prometheus
 								- **UniFi device monitoring** – detects offline switches, APs, and gateways
 								- **Ping reachability** – covers hosts without node_exporter
 								- **Cluster-wide detection** – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
 								- **Smart baseline tracking** – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
 								- **Ticket creation** – integrates with Tinker Tickets (`t.lotusguild.org`) with 24-hour deduplication
 								- **Alert suppression** – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
 								- **Authelia SSO** – restricted to `admin` group via forward-auth headers
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								---
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								## Alert Logic
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Ticket Triggers
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Condition | Priority |
 								|---|---|
 								| UniFi device offline (2+ consecutive checks) | P2 High |
 								| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
 								| Host unreachable via ping (2+ consecutive checks) | P2 High |
 								| 3+ hosts simultaneously reporting interface failures | P1 Critical |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Suppression Targets
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Type | Suppresses |
 								|---|---|
 								| `host` | All interface alerts for a named host |
 								| `interface` | A specific NIC on a specific host |
 								| `unifi_device` | A specific UniFi device |
 								| `all` | Everything (global maintenance mode) |
 								Suppressions can be manual (persist until removed) or timed (auto-expire).
 								---
 								## Configuration
 								**`config.json`** – shared by both processes:
 								| Key | Description |
 								|---|---|
 								| `unifi.api_key` | UniFi API key from controller |
 								| `prometheus.url` | Prometheus base URL |
 								| `database.*` | MariaDB credentials |
 								| `ticket_api.api_key` | Tinker Tickets Bearer token |
 								| `monitor.poll_interval` | Seconds between checks (default: 120) |
 								| `monitor.failure_threshold` | Consecutive failures before ticketing (default: 2) |
 								| `monitor.cluster_threshold` | Hosts with failures to trigger cluster alert (default: 3) |
 								| `monitor.ping_hosts` | Hosts checked via ping (no node_exporter) |
 								| `hosts` | Maps Prometheus instance labels to hostnames |
 								---
 								## Deployment (LXC 157)
 								### 1. Database (MariaDB LXC 149 at 10.10.10.50)
 								```sql
 								CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
 								CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
 								GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
 								FLUSH PRIVILEGES;
 								```
 								Then import the schema:
 								```bash
 								mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
 								```
 								### 2. LXC 157 – Install dependencies
 								```bash
 								pip3 install -r requirements.txt
 								```
 								### 3. Deploy files
 								```bash
 								cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
 								```
 								### 4. Configure secrets in `config.json`
 								- `database.password` – set the gandalf DB password
 								- `ticket_api.api_key` – copy from tinker tickets admin panel
 								### 5. Install the monitor service
 								```bash
 								cp gandalf-monitor.service /etc/systemd/system/
 								systemctl daemon-reload
 								systemctl enable gandalf-monitor
 								systemctl start gandalf-monitor
 								```
 								Update existing `gandalf.service` to use a single worker:
 								```
 								ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
 								```
 								### 6. Authelia rule
 								Add to `/etc/authelia/configuration.yml` access_control rules:
 								```yaml
 								- domain: gandalf.lotusguild.org
 								  policy: one_factor
 								  subject:
 								    - group:admin
 								```
-												docs: update README for storage-01 Prometheus migration

- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
  removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:05:27 -05:00
+								Reload Authelia: `systemctl restart authelia`
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								### 7. NPM proxy host
 								- Domain: `gandalf.lotusguild.org`
 								- Forward to: `http://10.10.10.61:80` (nginx on LXC 157)
 								- Enable Authelia forward auth
 								- WebSockets: **not required**
 								---
 								## Service Management
 								```bash
 								# Monitor daemon
 								systemctl status gandalf-monitor
 								journalctl -u gandalf-monitor -f
 								# Web server
 								systemctl status gandalf
 								journalctl -u gandalf -f
 								# Restart both after config/code changes
 								systemctl restart gandalf-monitor gandalf
 								```
 								---
 								## Troubleshooting
 								**Monitor not creating tickets**
 								- Check `config.json` → `ticket_api.api_key` is set
 								- Check `journalctl -u gandalf-monitor` for errors
 								**Baseline re-initializing on every restart**
 								- `interface_baseline` is stored in the `monitor_state` DB table; it persists across restarts
 								**Interface always showing as "initial_down"**
 								- That interface was down on the first poll after the monitor started
 								- It will begin tracking once it comes up; or manually update the baseline in DB if needed
 								**Prometheus data missing for a host**
 								- Verify node_exporter is running: `systemctl status prometheus-node-exporter`
 								- Check Prometheus targets: `http://10.10.10.48:9090/targets`
-												chore: trigger deploy test

											
										
										
											2026-03-02 11:58:42 -05:00
+								# 2026-03-02T16:58:42Z deploy test