After fixing the is_new guard bug, is_new is no longer used inside
_ticket_interface, _ticket_unifi, or _ticket_unreachable. Drop it from
their signatures and call sites.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
monitor.py: _ticket_interface/_ticket_unifi/_ticket_unreachable all used
`if tid and is_new` to guard db.set_ticket_id(). Since is_new is True only
on the first upsert (consec=1) but tickets are created at consec>=fail_thresh
(default 2), is_new is always False when the ticket is created, so the
ticket link never appeared in the UI. Changed to `if tid:`.
links.html: JSON.parse(sessionStorage.getItem(...)) in togglePanel and
restoreCollapseState had no try-catch. Corrupt/stale session storage would
throw an uncaught SyntaxError. Also wrapped all sessionStorage.setItem
calls in try-catch to defend against storage-full / private-browsing errors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
monitor.py _ssh_batch(): the remote command was wrapped in double-quotes
(f'root@{ip} "{shell_cmd}"') but shell_cmd itself contains double-quoted
echo sentinels ("___IFACE:eth0___"). When Pulse's shell parses the full
ssh invocation, the nested double-quotes cause mis-parsing — the remote
command is split incorrectly, silently breaking all ethtool/SFP DOM
collection. Fix: use shlex.quote(shell_cmd) so the entire remote command
is single-quoted, leaving inner double-quotes untouched.
TicketClient.create(): data['ticket_id'] raises KeyError if the Tinker
Tickets API returns success=true without a ticket_id field (malformed
response). Use data.get('ticket_id') with an explicit warning log.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
db.py returned all datetime columns (first_seen, last_seen, resolved_at,
created_at, expires_at) as bare ISO strings like "2026-03-14T14:14:21"
with no timezone marker. Per the ECMAScript spec, new Date() on a
datetime string without timezone treats it as LOCAL time, not UTC.
This made lt.time.ago() and stale-detection wrong for any user whose
browser is not in UTC — event ages and stale warnings would be off by
the client's UTC offset.
monitor.py had the same issue on the network_snapshot 'updated' field.
Fix: append 'Z' to all isoformat() calls (UTC datetimes confirmed by
MySQL server timezone and _now_utc() pattern used throughout codebase).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: validate server_name from LLDP with fullmatch before use in logs/lookups (prevents log injection)
- app.py: validate each mgmt_ip candidate before assigning host_ip (avoids assigning non-IP string that then fails later check)
- app.py: log actual exception in link_stats JSON parse error
- inspector.html: clear _diagPollTimer in closePanel() so timer doesn't orphan when panel is closed mid-poll
- monitor.py: sleep 30s after a monitor loop exception before resuming normal poll interval
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: replace raw str(e) in diagnostic _run() with generic client message; log internally only
- app.py: /health endpoint no longer leaks exception strings to unauthenticated callers; errors logged server-side
- monitor.py: UniFi SSL verification now defaults True, configurable via config.json unifi.verify_ssl; urllib3 warning suppression scoped to verify=False only (removed global disable)
- monitor.py: Pulse execution_id extracted with .get() + explicit None check to avoid KeyError on malformed response
- monitor.py: interface name regex drops '@' (not a valid kernel interface char) to match app.py and fix inconsistency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Architecture:
- Remove direct subprocess ping from Gandalf; add PulseClient.ping()
which runs the ping via the Pulse worker instead
- Remove standalone ping() function and subprocess import from monitor.py
- Add self.pulse alias to NetworkMonitor for convenience
- Both _process_ping_hosts() and snapshot builder now use self.pulse.ping()
Security:
- Change StrictHostKeyChecking=no → accept-new in both SSH command
builders (monitor.py _ssh_batch, diagnose.py build_ssh_command).
The Pulse worker's known_hosts is now authoritative; host keys are
recorded on first connection and verified on all subsequent ones.
MITM attacks after initial key exchange are now detectable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /api/avatar endpoint querying lldap for user jpegPhoto; disk cache
with sentinel pattern avoids repeat LDAP hits for users without photos
- Add ldap3 dependency and ldap config block to config.json
- Wire lt-avatar img overlay in base.html with capture-phase error
fallback (lt-avatar-img-err) to reveal initials when image is absent
- Fix lt-avatar CSS shim: position:relative + absolute inset on img
(local base.css was missing these; added to style.css)
- Replace all empty-state paragraphs with proper lt-empty-state markup
(icon + title + body) across index, suppressions, inspector, app.js
- Add lt-spinner--cyan next to refresh button; shows during refreshAll()
- Replace inspector panel-section-title with lt-divider throughout
- Add data-tooltip attributes to SFP DOM metrics, TX/RX/Carrier/Duplex/
Auto-neg/Error labels in links.html and inspector panel
- Add tooltips to events table column headers (Sev, First Seen, Failures)
- Fix links.html host panel timestamp (was reading sample.updated which
is always undefined; now uses data.updated)
- Fix UniFi status text casing (Online→ONLINE to match server render)
- Remove dead topo-status-* class manipulation from updateTopology()
- Always render alert-count-badge; toggle display:none when count is 0
- Fix double UniFi get_devices() call in monitor.py run loop
- Fix chip-critical animation (was using green pulse-glow; now red)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add links_exclude_ips to monitor config; collect() skips any Prometheus
instance whose IP is in that list, preventing LXC containers from
appearing on the links/inspector pages as phantom hosts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _collect_snapshot() and _process_interfaces() now skip any Prometheus
instance not explicitly listed in config.json hosts[]. LXC app servers
(postgresql, matrix, etc.) report node_exporter metrics but are not
infrastructure hosts Gandalf should display or alert on.
- Add PBS (10.10.10.3) to config hosts[] with prometheus_instance;
remove from ping_hosts (node_exporter already running on PBS, now
added to Prometheus scrape config as job pbs-node).
- The _instance_map membership check is now consistent across snapshot,
alerting, and ethtool SSH collection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LinkStatsCollector.collect() was SSHing into every host reporting
node_network_* metrics to Prometheus, including unrelated app servers
like postgresql and matrix. Add instance_map membership check so ethtool
collection via Pulse only runs on hosts defined in config.json.
Prometheus metrics (traffic rates, errors) are still collected for all
instances — only the SSH/ethtool step is gated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
monitor.py / diagnose.py PulseClient.run_command:
- Add automatic single retry on submit failure, explicit Pulse failure
(status=failed/timed_out), and poll timeout — handles transient SSH
or Pulse hiccups without dropping the whole collection cycle
- Log execution_id and full Pulse URL on every failure so failed runs
can be found in the Pulse UI immediately
- Handle 'timed_out' and 'cancelled' Pulse statuses explicitly (previously
only 'failed' was caught; others would spin until local deadline)
- Poll every 2s instead of 1s to reduce Pulse API chatter
SSH command options (_ssh_batch + diagnose.py):
- Add BatchMode=yes: aborts immediately instead of hanging on a
password prompt if key auth fails
- Add ServerAliveInterval=10 ServerAliveCountMax=2: SSH detects a
hung remote command within ~20s instead of sitting silent until the
45s Pulse timeout expires
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
static/app.js:
- Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF'
- Stale monitoring banner: injected into .main if last_check > 15 min old,
warns operator that the monitor daemon may be down
static/style.css:
- .stale-banner: amber top-border warning strip
app.py:
- /health now checks DB connectivity and monitor freshness (last_check age)
Returns 503 + degraded status if DB unreachable or monitor stale >20min
db.py:
- cleanup_expired_suppressions(): marks time-limited suppressions inactive when
expires_at <= NOW() (was only filtered in SELECTs, never marked inactive)
- purge_old_resolved_events(days=90): deletes old resolved events to prevent
unbounded table growth
monitor.py:
- Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
app.py:
- Context processor injects config.ticket_api.web_url into all templates
(falls back to 'http://t.lotusguild.org/ticket/' if not set in config)
templates/base.html:
- Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads
static/app.js:
- Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain
templates/index.html:
- Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain
monitor.py:
- CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name
from config monitor.cluster_name, falling back to the constant
- All CLUSTER_NAME references inside class methods replaced with self.cluster_name
templates/inspector.html:
- pollDiagnostic() .catch() now clears interval and shows error message instead
of silently ignoring network failures during active polling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
db.py:
- Add check_suppressed(suppressions, ...) for in-memory suppression lookups
against pre-loaded list (eliminates N*M DB queries per monitoring cycle)
- get_baseline(): log error instead of silently swallowing JSON parse failure
monitor.py:
- Load active suppressions once per cycle at the top of the alert loop
- Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts
- Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...)
- Reduces DB queries from 100-600+ per cycle down to 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder
- H6: Upgrade Pulse failure logging from debug to error so operators see outages
- M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min)
- M7: Background thread marks jobs stuck in 'running' > 5 min as errored
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds comprehensive per-port link troubleshooting triggered from the
Inspector panel when a port has an LLDP-identified server counterpart.
- diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier,
operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link,
ip addr, ip route, dmesg, lldpctl); parsers for all sections; health
analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH,
SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.)
- monitor.py: PulseClient now tracks last_execution_id so callers can
link back to the raw Pulse execution URL
- app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon
thread background execution and 10-minute in-memory job store
- inspector.html: "Run Link Diagnostics" button (shown only when LLDP
host is resolvable); full results panel: health banner, physical layer,
SFP/DOM with power bars, NIC error counters, collapsible ethtool -S,
flow control/ring buffers, driver info, LLDP 2-col validation,
collapsible dmesg, switch port summary, "View in Pulse" link
- style.css: all .diag-* CSS classes with terminal aesthetic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two-service architecture: Flask web app (gandalf.service) + background
polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>