README.md

# GANDALF (Global Advanced Network Detection And Link Facilitator)

> Because it shall not let problems pass.

Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.

**Design System**: [web_template](https://code.lotusguild.org/LotusGuild/web_template) — shared CSS, JS, and layout patterns for all LotusGuild apps

## Styling & Layout

GANDALF uses the **LotusGuild Terminal Design System**. For all styling, component, and layout documentation see:

- [`web_template/README.md`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/README.md) — full component reference, CSS variables, JS API
- [`web_template/base.css`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.css) — unified CSS (`.lt-*` classes)
- [`web_template/base.js`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.js) — `window.lt` utilities (toast, modal, auto-refresh, fetch helpers)
- [`web_template/python/base.html`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/base.html) — Jinja2 base template
- [`web_template/python/auth.py`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/auth.py) — `@require_auth` decorator pattern

---

## Architecture

Two processes share a MariaDB database:

| Process | Service | Role |
|---|---|---|
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |

```
[Prometheus :9090]  ──▶
[UniFi Controller]  ──▶  monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[Pulse Worker]      ──▶
[SSH / ethtool]     ──▶
```

### Data Sources

| Source | What it provides |
|---|---|
| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state + traffic/error rates via `node_exporter` |
| **UniFi API** (`https://10.10.10.1`) | Switch port stats, device status, LLDP neighbor table, PoE data |
| **Pulse Worker** | SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host |
| **Ping** | Reachability for hosts without `node_exporter` (e.g. PBS) |

### Monitored Hosts (Prometheus / node_exporter)

| Host | Prometheus Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |

Ping-only (no node_exporter): **pbs** (10.10.10.3)

---

## Pages

### Dashboard (`/`)
- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
- Network topology diagram (Internet → Gateway → Switches → Hosts)
- UniFi device table (switches, APs, gateway)
- Active alerts table with severity, target, consecutive failures, ticket link
- Quick-suppress modal: apply timed or manual suppression from any alert row
- Auto-refreshes every 30 seconds via `/api/status` + `/api/network`

### Link Debug (`/links`)
Per-interface statistics collected every poll cycle. All panels are collapsible
(click header or use Collapse All / Expand All). Collapse state persists across
page refreshes via `sessionStorage`.

**Server NICs** (via Prometheus + SSH/ethtool):
- Speed, duplex, auto-negotiation, link detected
- TX/RX rate bars (bandwidth utilisation % of link capacity)
- TX/RX error and drop rates per second
- Carrier changes (cumulative since boot — watch for flapping)
- **SFP / Optical panel** (when SFP module present): vendor/PN, temp, voltage,
  bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars

**UniFi Switch Ports** (via UniFi API):
- Port number badge (`#N`), UPLINK badge, PoE draw badge
- LLDP neighbor line: `→ system_name (port_id)` when neighbor is detected
- PoE class and max wattage line
- Speed, duplex, auto-neg, TX/RX rates, errors, drops

### Inspector (`/inspector`)
Visual switch chassis diagrams. Each switch is rendered model-accurately using
layout config in the template (`SWITCH_LAYOUTS`).

**Port block colours:**
| Colour | State |
|---|---|
| Green | Up, no active PoE |
| Amber | Up with active PoE draw |
| Cyan | Uplink port (up) |
| Grey | Down |
| White outline | Currently selected |

**Clicking a port** opens the right-side detail panel showing:
- Link stats (status, speed, duplex, auto-neg, media type)
- PoE (class, max wattage, current draw, mode)
- Traffic (TX/RX rates)
- Errors/drops per second
- **LLDP Neighbor** section (system name, port ID, chassis ID, management IPs)
- **Path Debug** (auto-appears when LLDP `system_name` matches a known server):
  two-column comparison of the switch port stats vs. the server NIC stats,
  including SFP DOM data if the server side has an SFP module

**LLDP path debug requirements:**
1. Server must run `lldpd`: `apt install lldpd && systemctl enable --now lldpd`
2. `lldpd` hostname must match the key in `data.hosts` (set via `config.json → hosts`)
3. Switch has LLDP enabled (UniFi default: on)

**Supported switch models** (set `SWITCH_LAYOUTS` keys to your UniFi model codes):

| Key | Model | Layout |
|---|---|---|
| `USF5P` | UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink |
| `USL8A` | UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) |
| `US24PRO` | UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP |
| `USPPDUP` | Custom/other | Single-port fallback |
| `USMINI` | UniFi Switch Mini | 5-port row |

Add new layouts by adding a key to `SWITCH_LAYOUTS` matching the `model` field
returned by the UniFi API for that device.

### Suppressions (`/suppressions`)
- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
- Target types: host, interface, UniFi device, or global
- Active suppressions table with one-click removal
- Suppression history (last 50)
- Available targets reference grid (all known hosts + interfaces)

---

## Alert Logic

### Ticket Triggers

| Condition | Priority |
|---|---|
| UniFi device offline (≥2 consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
| Host unreachable via ping (≥2 consecutive checks) | P2 High |
| ≥3 hosts simultaneously reporting interface failures | P1 Critical |

### Baseline Tracking

Interfaces that are **down on first observation** (unused ports, unplugged cables)
are recorded as `initial_down` and never alerted. Only **UP→DOWN regressions**
generate tickets. Baseline is stored in MariaDB and survives daemon restarts.

### Suppression Targets

| Type | Suppresses |
|---|---|
| `host` | All interface alerts for a named host |
| `interface` | A specific NIC on a specific host |
| `unifi_device` | A specific UniFi device |
| `all` | Everything (global maintenance mode) |

Suppressions can be manual (persist until removed) or timed (auto-expire).
Expired suppressions are checked at evaluation time — no background cleanup needed.

---

## Configuration (`config.json`)

Shared by both processes. Located in the working directory (`/var/www/html/prod/`).

```json
{
  "database": {
    "host": "10.10.10.50",
    "port": 3306,
    "user": "gandalf",
    "password": "...",
    "name": "gandalf"
  },
  "prometheus": {
    "url": "http://10.10.10.48:9090"
  },
  "unifi": {
    "controller": "https://10.10.10.1",
    "api_key": "...",
    "site_id": "default"
  },
  "ticket_api": {
    "url": "https://t.lotusguild.org/api/tickets",
    "api_key": "..."
  },
  "pulse": {
    "url": "http://<pulse-host>:<port>",
    "api_key": "...",
    "worker_id": "...",
    "timeout": 45
  },
  "auth": {
    "allowed_groups": ["admin"]
  },
  "hosts": [
    { "name": "large1",               "prometheus_instance": "10.10.10.2:9100" },
    { "name": "compute-storage-01",   "prometheus_instance": "10.10.10.4:9100" },
    { "name": "micro1",               "prometheus_instance": "10.10.10.8:9100" },
    { "name": "monitor-02",           "prometheus_instance": "10.10.10.9:9100" },
    { "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
    { "name": "storage-01",           "prometheus_instance": "10.10.10.11:9100" }
  ],
  "monitor": {
    "poll_interval": 120,
    "failure_threshold": 2,
    "cluster_threshold": 3,
    "ping_hosts": [
      { "name": "pbs", "ip": "10.10.10.3" }
    ]
  }
}
```

### Key Config Fields

| Key | Description |
|---|---|
| `database.*` | MariaDB credentials (LXC 149 at 10.10.10.50) |
| `prometheus.url` | Prometheus base URL |
| `unifi.controller` | UniFi controller base URL (HTTPS, self-signed cert ignored) |
| `unifi.api_key` | UniFi API key from controller Settings → API |
| `unifi.site_id` | UniFi site ID (default: `default`) |
| `ticket_api.api_key` | Tinker Tickets bearer token |
| `pulse.url` | Pulse worker API base URL (for SSH relay) |
| `pulse.worker_id` | Which Pulse worker runs ethtool collection |
| `pulse.timeout` | Max seconds to wait for SSH collection per host |
| `auth.allowed_groups` | Authelia groups that may access Gandalf |
| `hosts` | Maps Prometheus instance labels → display hostnames |
| `monitor.poll_interval` | Seconds between full check cycles (default: 120) |
| `monitor.failure_threshold` | Consecutive failures before creating ticket (default: 2) |
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster-wide P1 (default: 3) |
| `monitor.ping_hosts` | Hosts checked only by ping (no node_exporter) |

---

## Deployment (LXC 157)

### 1. Database — MariaDB LXC 149 (`10.10.10.50`)

```sql
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
```

Import schema:
```bash
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
```

### 2. LXC 157 — Install dependencies

```bash
pip3 install -r requirements.txt
# Ensure sshpass is available (used by deploy scripts)
apt install sshpass
```

### 3. Deploy files

```bash
# From dev machine / root/code/gandalf:
for f in app.py db.py monitor.py config.json schema.sql \
          static/style.css static/app.js \
          templates/*.html; do
  sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
    "$f" "root@10.10.10.61:/var/www/html/prod/$f"
done
systemctl restart gandalf gandalf-monitor
```

### 4. systemd services

**`gandalf.service`** (Flask/gunicorn web app):
```ini
[Unit]
Description=Gandalf Web Dashboard
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
Restart=always

[Install]
WantedBy=multi-user.target
```

**`gandalf-monitor.service`** (background polling daemon):
```ini
[Unit]
Description=Gandalf Network Monitor Daemon
After=network.target

[Service]
Type=simple
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 monitor.py
Restart=always

[Install]
WantedBy=multi-user.target
```

### 5. Authelia rule (LXC 167)

```yaml
access_control:
  rules:
    - domain: gandalf.lotusguild.org
      policy: one_factor
      subject:
        - group:admin
```

```bash
systemctl restart authelia
```

### 6. NPM reverse proxy

- **Domain:** `gandalf.lotusguild.org`
- **Forward to:** `http://10.10.10.61:8000` (gunicorn direct, no nginx needed on LXC)
- **Forward Auth:** Authelia at `http://10.10.10.167:9091`
- **WebSockets:** Not required

---

## Service Management

```bash
# Status
systemctl status gandalf gandalf-monitor

# Logs (live)
journalctl -u gandalf -f
journalctl -u gandalf-monitor -f

# Restart after code or config changes
systemctl restart gandalf gandalf-monitor
```

---

## Troubleshooting

### Monitor not creating tickets
- Verify `config.json → ticket_api.api_key` is set and valid
- Check `journalctl -u gandalf-monitor` for `Ticket creation failed` lines
- Confirm the Tinker Tickets API is reachable from LXC 157

### Link Debug shows no data / "Loading…" forever
- Check `gandalf-monitor.service` is running and has completed at least one cycle
- Check `journalctl -u gandalf-monitor` for Prometheus or UniFi errors
- Verify Prometheus is reachable: `curl http://10.10.10.48:9090/api/v1/query?query=up`

### Link Debug: SFP DOM panel missing
- SFP data requires Pulse worker + SSH access to hosts
- Verify `config.json → pulse.*` is configured and the Pulse worker is running
- Confirm `sshpass` + SSH access from the Pulse worker to each Proxmox host
- Only interfaces with physical SFP modules return DOM data (`ethtool -m`)

### Inspector: path debug section not appearing
- Requires LLDP: run `apt install lldpd && systemctl enable --now lldpd` on each server
- The LLDP `system_name` broadcast by `lldpd` must match the hostname in `config.json → hosts[].name`
  - Override: `echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd`
- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate

### Inspector: switch chassis shows as flat list (no layout)
- The switch's `model` field from UniFi doesn't match any key in `SWITCH_LAYOUTS` in `inspector.html`
- Check the UniFi API: the model appears in the `link_stats` API response under `unifi_switches.<name>.model`
- Add the model key to `SWITCH_LAYOUTS` in `inspector.html` with the correct row/SFP layout

### Baseline re-initializing on every restart
- `interface_baseline` is stored in the `monitor_state` DB table; survives restarts
- If it appears to reset: check DB connectivity from the monitor daemon

### Interface stuck at "initial_down" forever
- This means the interface was down when the monitor first saw it
- It will begin tracking once it comes up; or manually clear it:
  ```sql
  -- In MariaDB on 10.10.10.50:
  UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
  ```
  Then restart the monitor: `systemctl restart gandalf-monitor`

### Prometheus data missing for a host
```bash
# On the affected host:
systemctl status prometheus-node-exporter
# Verify it's scraped:
curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
```

---

## Development Notes

### File Layout

```
gandalf/
├── app.py              # Flask web app (routes, auth, API endpoints)
├── monitor.py          # Background daemon (Prometheus, UniFi, Pulse, alert logic)
├── db.py               # Database operations (MariaDB via pymysql, thread-local conn reuse)
├── schema.sql          # Database schema (network_events, suppression_rules, monitor_state)
├── config.json         # Runtime configuration (not committed with secrets)
├── requirements.txt    # Python dependencies
├── static/
│   ├── style.css       # Terminal aesthetic CSS (CRT scanlines, green-on-black)
│   └── app.js          # Dashboard JS (auto-refresh, host grid, events, suppress modal)
└── templates/
    ├── base.html       # Shared layout (header, nav, footer)
    ├── index.html      # Dashboard page
    ├── links.html      # Link Debug page (server NICs + UniFi switch ports)
    ├── inspector.html  # Visual switch inspector + LLDP path debug
    └── suppressions.html # Suppression management page
```

### Adding a New Monitored Host

1. Install `prometheus-node-exporter` on the host
2. Add a scrape target to Prometheus config
3. Add an entry to `config.json → hosts`:
   ```json
   { "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
   ```
4. Restart monitor: `systemctl restart gandalf-monitor`
5. For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker

### Adding a New Switch Layout (Inspector)

Find the UniFi model code for the switch (it appears in the `/api/links` JSON response
under `unifi_switches.<switch_name>.model`), then add to `SWITCH_LAYOUTS` in
`templates/inspector.html`:

```javascript
'MYNEWMODEL': {
  rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]],  // port_idx by row
  sfp_section: [17, 18],  // separate SFP cage ports (rendered below rows)
  sfp_ports: [],          // port_idx values that are SFP-type within rows
},
```

### Database Schema Notes

- `network_events`: one row per active event; `resolved_at` is set when recovered
- `suppression_rules`: `active=FALSE` when removed; `expires_at` checked at query time
- `monitor_state`: key/value store; `interface_baseline` and `link_stats` are JSON blobs

### Security Notes

- **XSS prevention**: all user-controlled data in dynamically generated HTML uses
  `escHtml()` (JS) or Jinja2 auto-escaping (Python). Suppress buttons use `data-*`
  attributes + a single delegated click listener rather than inline `onclick` with
  interpolated strings.
- **Interface name validation**: `monitor.py` validates SSH interface names against
  `^[a-zA-Z0-9_.@-]+$` before use, and additionally wraps them with `shlex.quote()`
  for defense-in-depth.
- **DB parameters**: all SQL uses parameterised queries via pymysql — no string
  concatenation into SQL.
- **Auth**: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
  app additionally checks the `Remote-User` header via `@require_auth`.

### Known Limitations

- Single gunicorn worker (`--workers 1`) — required because `db.py` uses thread-local
  connection reuse (one connection per thread). Multiple workers would each have their
  own connection, which is fine, but the thread-local optimisation only helps within
  one worker.
- No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
  `SameSite=Strict` and the site being admin-only.
- SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
  is delayed. The `pulse.timeout` config controls the max wait.
- UniFi LLDP data is only as fresh as the last monitor poll (120s default).
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
+								# GANDALF (Global Advanced Network Detection And Link Facilitator)
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								> Because it shall not let problems pass.
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								Network monitoring dashboard for the LotusGuild Proxmox cluster.
 								Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Apply LotusGuild design system convergence (aesthetic_diff.md)

CSS (style.css):
- §1: Add unified naming aliases (--terminal-green, --bg-primary, etc.)
- §2: Upgrade borders: modal 1px→3px double, btn/btn-sm/inputs 1px→2px
- §3: Add [ ] bracket decorations to .btn classes; primary keeps > prefix;
  hover lift -1px→-2px; padding 6px 14px→5px 12px
- §4: Fix glow definitions from 2-layer rgba to 3-layer solid stack
- §5: Section headers now symmetric ╠═══ TITLE ═══╣ (was one-sided)
- §6+§7: Modal border 3px double, corners ┌┐→╔╗, add glow shadow
- §11: Nav active state now amber tint (was green); hover remains green
- §15: Scanline opacity 0.13→0.15; flicker delay 45s→30s

JS (app.js):
- §18: Replace custom showToast() with lt.toast.* delegate wrapper

Templates (base.html):
- Load base.css and base.js (symlinked from web_template)
- Add lt-boot overlay for boot sequence animation (§13)

README: Remove completed pending convergence items

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-14 21:40:20 -04:00
+								**Design System**: [web_template](https://code.lotusguild.org/LotusGuild/web_template) — shared CSS, JS, and layout patterns for all LotusGuild apps
 								## Styling & Layout
 								GANDALF uses the **LotusGuild Terminal Design System**. For all styling, component, and layout documentation see:
 								- [`web_template/README.md`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/README.md) — full component reference, CSS variables, JS API
 								- [`web_template/base.css`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.css) — unified CSS (`.lt-*` classes)
 								- [`web_template/base.js`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/base.js) — `window.lt` utilities (toast, modal, auto-refresh, fetch helpers)
 								- [`web_template/python/base.html`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/base.html) — Jinja2 base template
 								- [`web_template/python/auth.py`](https://code.lotusguild.org/LotusGuild/web_template/src/branch/main/python/auth.py) — `@require_auth` decorator pattern
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								---
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								## Architecture
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								Two processes share a MariaDB database:
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Process | Service | Role |
 								|---|---|---|
 								| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
 								| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								[Prometheus :9090]  ──▶
 								[UniFi Controller]  ──▶  monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
 								[Pulse Worker]      ──▶
 								[SSH / ethtool]     ──▶
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Data Sources
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								| Source | What it provides |
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								|---|---|
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state + traffic/error rates via `node_exporter` |
 								| **UniFi API** (`https://10.10.10.1`) | Switch port stats, device status, LLDP neighbor table, PoE data |
 								| **Pulse Worker** | SSH relay — runs `ethtool` + SFP DOM queries on each Proxmox host |
 								| **Ping** | Reachability for hosts without `node_exporter` (e.g. PBS) |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Monitored Hosts (Prometheus / node_exporter)
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								| Host | Prometheus Instance |
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								|---|---|
 								| large1 | 10.10.10.2:9100 |
 								| compute-storage-01 | 10.10.10.4:9100 |
 								| micro1 | 10.10.10.8:9100 |
 								| monitor-02 | 10.10.10.9:9100 |
 								| compute-storage-gpu-01 | 10.10.10.10:9100 |
-												docs: update README for storage-01 Prometheus migration

- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
  removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:05:27 -05:00
+								| storage-01 | 10.10.10.11:9100 |
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								Ping-only (no node_exporter): **pbs** (10.10.10.3)
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								---
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								## Pages
 								### Dashboard (`/`)
 								- Real-time host status grid with per-NIC link state (UP / DOWN / degraded)
 								- Network topology diagram (Internet → Gateway → Switches → Hosts)
 								- UniFi device table (switches, APs, gateway)
 								- Active alerts table with severity, target, consecutive failures, ticket link
 								- Quick-suppress modal: apply timed or manual suppression from any alert row
 								- Auto-refreshes every 30 seconds via `/api/status` + `/api/network`
 								### Link Debug (`/links`)
 								Per-interface statistics collected every poll cycle. All panels are collapsible
 								(click header or use Collapse All / Expand All). Collapse state persists across
 								page refreshes via `sessionStorage`.
 								**Server NICs** (via Prometheus + SSH/ethtool):
 								- Speed, duplex, auto-negotiation, link detected
 								- TX/RX rate bars (bandwidth utilisation % of link capacity)
 								- TX/RX error and drop rates per second
 								- Carrier changes (cumulative since boot — watch for flapping)
 								- **SFP / Optical panel** (when SFP module present): vendor/PN, temp, voltage,
 								  bias current, TX power (dBm), RX power (dBm), RX−TX delta, per-stat bars
 								**UniFi Switch Ports** (via UniFi API):
 								- Port number badge (`#N`), UPLINK badge, PoE draw badge
 								- LLDP neighbor line: `→ system_name (port_id)` when neighbor is detected
 								- PoE class and max wattage line
 								- Speed, duplex, auto-neg, TX/RX rates, errors, drops
 								### Inspector (`/inspector`)
 								Visual switch chassis diagrams. Each switch is rendered model-accurately using
 								layout config in the template (`SWITCH_LAYOUTS`).
 								**Port block colours:**
 								| Colour | State |
 								|---|---|
 								| Green | Up, no active PoE |
 								| Amber | Up with active PoE draw |
 								| Cyan | Uplink port (up) |
 								| Grey | Down |
 								| White outline | Currently selected |
 								**Clicking a port** opens the right-side detail panel showing:
 								- Link stats (status, speed, duplex, auto-neg, media type)
 								- PoE (class, max wattage, current draw, mode)
 								- Traffic (TX/RX rates)
 								- Errors/drops per second
 								- **LLDP Neighbor** section (system name, port ID, chassis ID, management IPs)
 								- **Path Debug** (auto-appears when LLDP `system_name` matches a known server):
 								  two-column comparison of the switch port stats vs. the server NIC stats,
 								  including SFP DOM data if the server side has an SFP module
 								**LLDP path debug requirements:**
 . Server must run `lldpd`: `apt install lldpd && systemctl enable --now lldpd`
 . `lldpd` hostname must match the key in `data.hosts` (set via `config.json → hosts`)
 . Switch has LLDP enabled (UniFi default: on)
 								**Supported switch models** (set `SWITCH_LAYOUTS` keys to your UniFi model codes):
 								| Key | Model | Layout |
 								|---|---|---|
 								| `USF5P` | UniFi Switch Flex 5 PoE | 4×RJ45 + 1×SFP uplink |
 								| `USL8A` | UniFi Switch Lite 8 PoE | 8×SFP (2 rows of 4) |
 								| `US24PRO` | UniFi Switch Pro 24 | 24×RJ45 staggered + 2×SFP |
 								| `USPPDUP` | Custom/other | Single-port fallback |
 								| `USMINI` | UniFi Switch Mini | 5-port row |
 								Add new layouts by adding a key to `SWITCH_LAYOUTS` matching the `model` field
 								returned by the UniFi API for that device.
 								### Suppressions (`/suppressions`)
 								- Create timed (30 min / 1 hr / 4 hr / 8 hr) or manual suppressions
 								- Target types: host, interface, UniFi device, or global
 								- Active suppressions table with one-click removal
 								- Suppression history (last 50)
 								- Available targets reference grid (all known hosts + interfaces)
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								---
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								## Alert Logic
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Ticket Triggers
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Condition | Priority |
 								|---|---|
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								| UniFi device offline (≥2 consecutive checks) | P2 High |
 								| Proxmox host NIC link-down regression (≥2 consecutive checks) | P2 High |
 								| Host unreachable via ping (≥2 consecutive checks) | P2 High |
 								| ≥3 hosts simultaneously reporting interface failures | P1 Critical |
 								### Baseline Tracking
 								Interfaces that are **down on first observation** (unused ports, unplugged cables)
 								are recorded as `initial_down` and never alerted. Only **UP→DOWN regressions**
 								generate tickets. Baseline is stored in MariaDB and survives daemon restarts.
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								### Suppression Targets
-												first commit

											
										
										
											2025-01-04 00:07:15 -05:00
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| Type | Suppresses |
 								|---|---|
 								| `host` | All interface alerts for a named host |
 								| `interface` | A specific NIC on a specific host |
 								| `unifi_device` | A specific UniFi device |
 								| `all` | Everything (global maintenance mode) |
 								Suppressions can be manual (persist until removed) or timed (auto-expire).
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								Expired suppressions are checked at evaluation time — no background cleanup needed.
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								---
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								## Configuration (`config.json`)
 								Shared by both processes. Located in the working directory (`/var/www/html/prod/`).
 								```json
 								{
 								  "database": {
 								    "host": "10.10.10.50",
 								    "port": 3306,
 								    "user": "gandalf",
 								    "password": "...",
 								    "name": "gandalf"
 								  },
 								  "prometheus": {
 								    "url": "http://10.10.10.48:9090"
 								  },
 								  "unifi": {
 								    "controller": "https://10.10.10.1",
 								    "api_key": "...",
 								    "site_id": "default"
 								  },
 								  "ticket_api": {
 								    "url": "https://t.lotusguild.org/api/tickets",
 								    "api_key": "..."
 								  },
 								  "pulse": {
 								    "url": "http://<pulse-host>:<port>",
 								    "api_key": "...",
 								    "worker_id": "...",
 								    "timeout": 45
 								  },
 								  "auth": {
 								    "allowed_groups": ["admin"]
 								  },
 								  "hosts": [
 								    { "name": "large1",               "prometheus_instance": "10.10.10.2:9100" },
 								    { "name": "compute-storage-01",   "prometheus_instance": "10.10.10.4:9100" },
 								    { "name": "micro1",               "prometheus_instance": "10.10.10.8:9100" },
 								    { "name": "monitor-02",           "prometheus_instance": "10.10.10.9:9100" },
 								    { "name": "compute-storage-gpu-01", "prometheus_instance": "10.10.10.10:9100" },
 								    { "name": "storage-01",           "prometheus_instance": "10.10.10.11:9100" }
 								  ],
 								  "monitor": {
 								    "poll_interval": 120,
 								    "failure_threshold": 2,
 								    "cluster_threshold": 3,
 								    "ping_hosts": [
 								      { "name": "pbs", "ip": "10.10.10.3" }
 								    ]
 								  }
 								}
 								```
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### Key Config Fields
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								| Key | Description |
 								|---|---|
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								| `database.*` | MariaDB credentials (LXC 149 at 10.10.10.50) |
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								| `prometheus.url` | Prometheus base URL |
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								| `unifi.controller` | UniFi controller base URL (HTTPS, self-signed cert ignored) |
 								| `unifi.api_key` | UniFi API key from controller Settings → API |
 								| `unifi.site_id` | UniFi site ID (default: `default`) |
 								| `ticket_api.api_key` | Tinker Tickets bearer token |
 								| `pulse.url` | Pulse worker API base URL (for SSH relay) |
 								| `pulse.worker_id` | Which Pulse worker runs ethtool collection |
 								| `pulse.timeout` | Max seconds to wait for SSH collection per host |
 								| `auth.allowed_groups` | Authelia groups that may access Gandalf |
 								| `hosts` | Maps Prometheus instance labels → display hostnames |
 								| `monitor.poll_interval` | Seconds between full check cycles (default: 120) |
 								| `monitor.failure_threshold` | Consecutive failures before creating ticket (default: 2) |
 								| `monitor.cluster_threshold` | Hosts with failures to trigger cluster-wide P1 (default: 3) |
 								| `monitor.ping_hosts` | Hosts checked only by ping (no node_exporter) |
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								---
 								## Deployment (LXC 157)
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### 1. Database — MariaDB LXC 149 (`10.10.10.50`)
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								```sql
 								CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
 								CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
 								GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
 								FLUSH PRIVILEGES;
 								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								Import schema:
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```bash
 								mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
 								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### 2. LXC 157 — Install dependencies
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								```bash
 								pip3 install -r requirements.txt
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								# Ensure sshpass is available (used by deploy scripts)
 								apt install sshpass
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
 								### 3. Deploy files
 								```bash
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								# From dev machine / root/code/gandalf:
 								for f in app.py db.py monitor.py config.json schema.sql \
 								          static/style.css static/app.js \
 								          templates/*.html; do
 								  sshpass -p 'yourpass' scp -o StrictHostKeyChecking=no \
 								    "$f" "root@10.10.10.61:/var/www/html/prod/$f"
 								done
 								systemctl restart gandalf gandalf-monitor
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### 4. systemd services
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								**`gandalf.service`** (Flask/gunicorn web app):
 								```ini
 								[Unit]
 								Description=Gandalf Web Dashboard
 								After=network.target
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								[Service]
 								Type=simple
 								WorkingDirectory=/var/www/html/prod
 								ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
 								Restart=always
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								[Install]
 								WantedBy=multi-user.target
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								**`gandalf-monitor.service`** (background polling daemon):
 								```ini
 								[Unit]
 								Description=Gandalf Network Monitor Daemon
 								After=network.target
 								[Service]
 								Type=simple
 								WorkingDirectory=/var/www/html/prod
 								ExecStart=/usr/bin/python3 monitor.py
 								Restart=always
 								[Install]
 								WantedBy=multi-user.target
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### 5. Authelia rule (LXC 167)
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								```yaml
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								access_control:
 								  rules:
 								    - domain: gandalf.lotusguild.org
 								      policy: one_factor
 								      subject:
 								        - group:admin
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								```bash
 								systemctl restart authelia
 								```
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### 6. NPM reverse proxy
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								- **Domain:** `gandalf.lotusguild.org`
 								- **Forward to:** `http://10.10.10.61:8000` (gunicorn direct, no nginx needed on LXC)
 								- **Forward Auth:** Authelia at `http://10.10.10.167:9091`
 								- **WebSockets:** Not required
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
 								---
 								## Service Management
 								```bash
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								# Status
 								systemctl status gandalf gandalf-monitor
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								# Logs (live)
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								journalctl -u gandalf -f
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								journalctl -u gandalf-monitor -f
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								# Restart after code or config changes
 								systemctl restart gandalf gandalf-monitor
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
+								```
 								---
 								## Troubleshooting
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### Monitor not creating tickets
 								- Verify `config.json → ticket_api.api_key` is set and valid
 								- Check `journalctl -u gandalf-monitor` for `Ticket creation failed` lines
 								- Confirm the Tinker Tickets API is reachable from LXC 157
 								### Link Debug shows no data / "Loading…" forever
 								- Check `gandalf-monitor.service` is running and has completed at least one cycle
 								- Check `journalctl -u gandalf-monitor` for Prometheus or UniFi errors
 								- Verify Prometheus is reachable: `curl http://10.10.10.48:9090/api/v1/query?query=up`
 								### Link Debug: SFP DOM panel missing
 								- SFP data requires Pulse worker + SSH access to hosts
 								- Verify `config.json → pulse.*` is configured and the Pulse worker is running
 								- Confirm `sshpass` + SSH access from the Pulse worker to each Proxmox host
 								- Only interfaces with physical SFP modules return DOM data (`ethtool -m`)
 								### Inspector: path debug section not appearing
 								- Requires LLDP: run `apt install lldpd && systemctl enable --now lldpd` on each server
 								- The LLDP `system_name` broadcast by `lldpd` must match the hostname in `config.json → hosts[].name`
 								  - Override: `echo 'configure system hostname large1' > /etc/lldpd.d/hostname.conf && systemctl restart lldpd`
 								- Allow up to 2 poll cycles (240s) after installing lldpd for LLDP table to populate
 								### Inspector: switch chassis shows as flat list (no layout)
 								- The switch's `model` field from UniFi doesn't match any key in `SWITCH_LAYOUTS` in `inspector.html`
 								- Check the UniFi API: the model appears in the `link_stats` API response under `unifi_switches.<name>.model`
 								- Add the model key to `SWITCH_LAYOUTS` in `inspector.html` with the correct row/SFP layout
 								### Baseline re-initializing on every restart
 								- `interface_baseline` is stored in the `monitor_state` DB table; survives restarts
 								- If it appears to reset: check DB connectivity from the monitor daemon
 								### Interface stuck at "initial_down" forever
 								- This means the interface was down when the monitor first saw it
 								- It will begin tracking once it comes up; or manually clear it:
 								  ```sql
 								  -- In MariaDB on 10.10.10.50:
 								  UPDATE monitor_state SET value='{}' WHERE key_name='interface_baseline';
 								  ```
 								  Then restart the monitor: `systemctl restart gandalf-monitor`
 								### Prometheus data missing for a host
 								```bash
 								# On the affected host:
 								systemctl status prometheus-node-exporter
 								# Verify it's scraped:
 								curl http://10.10.10.48:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="node")'
 								```
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								---
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								## Development Notes
 								### File Layout
 								```
 								gandalf/
 								├── app.py              # Flask web app (routes, auth, API endpoints)
 								├── monitor.py          # Background daemon (Prometheus, UniFi, Pulse, alert logic)
 								├── db.py               # Database operations (MariaDB via pymysql, thread-local conn reuse)
 								├── schema.sql          # Database schema (network_events, suppression_rules, monitor_state)
 								├── config.json         # Runtime configuration (not committed with secrets)
 								├── requirements.txt    # Python dependencies
 								├── static/
 								│   ├── style.css       # Terminal aesthetic CSS (CRT scanlines, green-on-black)
 								│   └── app.js          # Dashboard JS (auto-refresh, host grid, events, suppress modal)
 								└── templates/
 								    ├── base.html       # Shared layout (header, nav, footer)
 								    ├── index.html      # Dashboard page
 								    ├── links.html      # Link Debug page (server NICs + UniFi switch ports)
 								    ├── inspector.html  # Visual switch inspector + LLDP path debug
 								    └── suppressions.html # Suppression management page
 								```
 								### Adding a New Monitored Host
 . Install `prometheus-node-exporter` on the host
 . Add a scrape target to Prometheus config
 . Add an entry to `config.json → hosts`:
 								   ```json
 								   { "name": "newhost", "prometheus_instance": "10.10.10.X:9100" }
 								   ```
 . Restart monitor: `systemctl restart gandalf-monitor`
 . For SFP DOM / ethtool: ensure the host is SSH-accessible from the Pulse worker
 								### Adding a New Switch Layout (Inspector)
 								Find the UniFi model code for the switch (it appears in the `/api/links` JSON response
 								under `unifi_switches.<switch_name>.model`), then add to `SWITCH_LAYOUTS` in
 								`templates/inspector.html`:
 								```javascript
 								'MYNEWMODEL': {
 								  rows: [[1,2,3,4,5,6,7,8], [9,10,11,12,13,14,15,16]],  // port_idx by row
 								  sfp_section: [17, 18],  // separate SFP cage ports (rendered below rows)
 								  sfp_ports: [],          // port_idx values that are SFP-type within rows
 								},
 								```
-												Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-01 23:03:18 -05:00
-												feat: inspector page, link debug enhancements, security hardening

- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 15:39:48 -05:00
+								### Database Schema Notes
 								- `network_events`: one row per active event; `resolved_at` is set when recovered
 								- `suppression_rules`: `active=FALSE` when removed; `expires_at` checked at query time
 								- `monitor_state`: key/value store; `interface_baseline` and `link_stats` are JSON blobs
 								### Security Notes
 								- **XSS prevention**: all user-controlled data in dynamically generated HTML uses
 								  `escHtml()` (JS) or Jinja2 auto-escaping (Python). Suppress buttons use `data-*`
 								  attributes + a single delegated click listener rather than inline `onclick` with
 								  interpolated strings.
 								- **Interface name validation**: `monitor.py` validates SSH interface names against
 								  `^[a-zA-Z0-9_.@-]+$` before use, and additionally wraps them with `shlex.quote()`
 								  for defense-in-depth.
 								- **DB parameters**: all SQL uses parameterised queries via pymysql — no string
 								  concatenation into SQL.
 								- **Auth**: Authelia enforces admin-only access at the nginx/LXC 167 layer; the Flask
 								  app additionally checks the `Remote-User` header via `@require_auth`.
 								### Known Limitations
 								- Single gunicorn worker (`--workers 1`) — required because `db.py` uses thread-local
 								  connection reuse (one connection per thread). Multiple workers would each have their
 								  own connection, which is fine, but the thread-local optimisation only helps within
 								  one worker.
 								- No CSRF tokens on API endpoints — mitigated by Authelia session cookies being
 								  `SameSite=Strict` and the site being admin-only.
 								- SSH collection via Pulse is synchronous — if Pulse is slow, the entire monitor cycle
 								  is delayed. The `pulse.timeout` config controls the max wait.
 								- UniFi LLDP data is only as fresh as the last monitor poll (120s default).