Troubleshooting
Common issues, diagnostic commands, log locations, and escalation paths.
Last updated 17 May 2026
This page covers the most common issues administrators encounter and how to resolve them.
Appliance offline
Symptoms
- The appliance shows as "offline" in the operator portal.
- Users cannot reach the gateway API.
- No heartbeats recorded for > 5 minutes.
Diagnosis
-
Physical check — verify the appliance is powered on and the network link LED is active.
-
Network check — from a machine on the same VLAN, ping the appliance IP.
-
DNS check — verify
appliance.<your-domain>resolves correctly. -
Outbound check — from the appliance console (if accessible), test connectivity to the central plane:
curl -sS -o /dev/null -w "%{http_code}" \ https://ops.<region>.operayde.com/healthz/live # Expected: 200 -
Service check — on the appliance console:
systemctl status operayde-gateway systemctl status operayde-audit-emitter systemctl status operayde-tunnel-agent
Common causes
| Cause | Fix |
|---|---|
| Network cable disconnected | Reseat the cable; check switch port status |
| Firewall blocking outbound 443 | Add rule per deployment guide |
| DNS resolution failure | Verify DNS config; check /etc/resolv.conf on appliance |
| NTP drift > 30 seconds | Fix NTP; the appliance rejects JWTs with clock skew > 30s |
| Disk full | Check with df -h; clear /var/log/operayde/ old rotated logs |
| Service crashed | Check journalctl -u operayde-gateway --since "10 min ago" |
High latency
Symptoms
- Chat completions take > 5 seconds for short prompts.
- Users report slow responses.
- Gateway metrics show elevated p99 latency.
Diagnosis
# Check GPU utilisation (Starter/Pro only)
nvidia-smi
# Check CPU and memory
top -bn1 | head -20
# Check inference queue depth
curl -s http://127.0.0.1:9090/metrics | grep operayde_inference_queue_depth
# Check active sessions
curl -s http://127.0.0.1:9090/metrics | grep operayde_active_sessionsCommon causes
| Cause | Fix |
|---|---|
| GPU memory exhausted | Reduce concurrent sessions or upgrade tier |
| CPU thermal throttling | Check airflow; verify ambient temperature < 35 C |
| Too many concurrent requests | Reduce RPM limits on virtual keys |
| Large context windows | Reduce max_tokens or use a smaller model |
| OPA policy evaluation slow | Check bundle size; large data documents slow eval |
Authentication failures
Symptoms
- Users see "401 Unauthorized" or "403 Forbidden" errors.
- Virtual key operations return authentication errors.
- SSO redirect fails or loops.
Diagnosis
# Verify virtual key is active
curl -s -H "Authorization: Bearer $KEY" \
https://appliance.example.com/v1/models
# Check key status in portal: Keys > search for the key label
# Check OPA decision for a specific key
curl -s http://127.0.0.1:8181/v1/data/operayde/virtual_keys/allow \
-d '{"input":{"action":"gateway.chat","params":{"virtual_key":{"revoked":false}}}}'Common causes
| Cause | Fix |
|---|---|
| Key revoked | Create a new key in the portal |
| Key expired | Create a new key with a later expiry |
| RPM/TPD budget exhausted | Wait for the limit to reset or increase limits |
| Model not in allow list | Edit the key to add the requested model |
| Clock skew > 30 seconds | Fix NTP on the appliance |
| OIDC token expired | Sign out and sign in again |
| IdP group not mapped | Configure group mapping in portal settings |
Billing discrepancies
Symptoms
- Invoice total does not match expected usage.
- Usage dashboard shows different numbers than the invoice.
- Missing or duplicate line items.
Diagnosis
- Compare time ranges — the invoice covers a fixed billing period; the usage dashboard shows real-time data.
- Check aggregation delay — usage data from appliances is batched and may have up to a 15-minute delay.
- Verify key attribution — usage is attributed to the key that made the request. If a key was reassigned mid-period, usage splits across both owners.
Resolution
- For discrepancies > 5%, open a support ticket with the invoice ID and the date range you expect.
- Operayde support can run a reconciliation report that compares appliance-side metering with central aggregation.
Diagnostic commands
Appliance health
# Overall health
curl -s https://appliance.example.com/v1/health | jq .
# Detailed metrics (localhost only)
curl -s http://127.0.0.1:9090/metrics | grep operayde_
# Gateway logs (last 100 lines)
journalctl -u operayde-gateway -n 100 --no-pager
# Audit emitter logs
journalctl -u operayde-audit-emitter -n 100 --no-pager
# Tunnel agent logs (central plane connectivity)
journalctl -u operayde-tunnel-agent -n 100 --no-pagerOPA policy debugging
# Check which bundle is loaded
curl -s http://127.0.0.1:8181/v1/policies | jq '.result[].id'
# Evaluate a test decision
curl -s -X POST http://127.0.0.1:8181/v1/data/operayde/rbac/tenant/allow \
-H "Content-Type: application/json" \
-d '{
"input": {
"principal": {
"realm": "tenant:YOUR_TENANT_ID",
"groups": ["tenant-admin"]
},
"action": "config.list-virtual-keys",
"params": {
"tenant_id": "YOUR_TENANT_ID"
}
}
}' | jq .Network diagnostics
# Test central plane connectivity
curl -sS -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" \
https://ops.<region>.operayde.com/healthz/live
# Test DNS resolution
dig appliance.<your-domain>
# Check open connections
ss -tnp | grep operaydeLog locations
| Log | Location | Rotation |
|---|---|---|
| Gateway | journalctl -u operayde-gateway | systemd journal, 500 MB max |
| Audit emitter | journalctl -u operayde-audit-emitter | systemd journal, 500 MB max |
| Tunnel agent | journalctl -u operayde-tunnel-agent | systemd journal, 500 MB max |
| OPA | journalctl -u operayde-opa | systemd journal, 200 MB max |
| System | /var/log/syslog | logrotate, 7 days |
| Inference engine | /var/log/operayde/inference.log | logrotate, 1 GB max, 3 rotations |
Escalation path
| Severity | Description | Response time | Channel |
|---|---|---|---|
| P1 — Critical | Appliance down, data loss risk, security incident | 1 hour | Emergency hotline + portal ticket |
| P2 — High | Degraded performance, partial outage, auth failure | 4 hours | Portal ticket |
| P3 — Medium | Non-critical bug, billing question, config assistance | 1 business day | Portal ticket or email |
| P4 — Low | Feature request, documentation question | 3 business days | Portal ticket or email |
How to open a ticket
- Go to Settings > Support in the portal.
- Select the severity level.
- Describe the issue with as much detail as possible.
- Attach relevant logs (use the Collect diagnostics button to generate a support bundle automatically).
Collecting a support bundle
From the portal: Appliances > [your appliance] > Collect diagnostics.
This generates a tarball containing:
- Gateway, audit, and tunnel agent logs (last 24 hours)
- OPA bundle metadata (not the policies themselves)
- System resource usage snapshots
- Network connectivity test results
- Anonymised configuration (no secrets or keys)
The bundle is uploaded to Operayde support automatically. You can also download it for your own review.