Skip to main content
Operayde
Talk to us
/
Operations

Troubleshooting

Common issues, diagnostic commands, log locations, and escalation paths.

Last updated 17 May 2026

This page covers the most common issues administrators encounter and how to resolve them.

Appliance offline

Symptoms

  • The appliance shows as "offline" in the operator portal.
  • Users cannot reach the gateway API.
  • No heartbeats recorded for > 5 minutes.

Diagnosis

  1. Physical check — verify the appliance is powered on and the network link LED is active.

  2. Network check — from a machine on the same VLAN, ping the appliance IP.

  3. DNS check — verify appliance.<your-domain> resolves correctly.

  4. Outbound check — from the appliance console (if accessible), test connectivity to the central plane:

    curl -sS -o /dev/null -w "%{http_code}" \
      https://ops.<region>.operayde.com/healthz/live
    # Expected: 200
  5. Service check — on the appliance console:

    systemctl status operayde-gateway
    systemctl status operayde-audit-emitter
    systemctl status operayde-tunnel-agent

Common causes

CauseFix
Network cable disconnectedReseat the cable; check switch port status
Firewall blocking outbound 443Add rule per deployment guide
DNS resolution failureVerify DNS config; check /etc/resolv.conf on appliance
NTP drift > 30 secondsFix NTP; the appliance rejects JWTs with clock skew > 30s
Disk fullCheck with df -h; clear /var/log/operayde/ old rotated logs
Service crashedCheck journalctl -u operayde-gateway --since "10 min ago"

High latency

Symptoms

  • Chat completions take > 5 seconds for short prompts.
  • Users report slow responses.
  • Gateway metrics show elevated p99 latency.

Diagnosis

# Check GPU utilisation (Starter/Pro only)
nvidia-smi
 
# Check CPU and memory
top -bn1 | head -20
 
# Check inference queue depth
curl -s http://127.0.0.1:9090/metrics | grep operayde_inference_queue_depth
 
# Check active sessions
curl -s http://127.0.0.1:9090/metrics | grep operayde_active_sessions

Common causes

CauseFix
GPU memory exhaustedReduce concurrent sessions or upgrade tier
CPU thermal throttlingCheck airflow; verify ambient temperature < 35 C
Too many concurrent requestsReduce RPM limits on virtual keys
Large context windowsReduce max_tokens or use a smaller model
OPA policy evaluation slowCheck bundle size; large data documents slow eval

Authentication failures

Symptoms

  • Users see "401 Unauthorized" or "403 Forbidden" errors.
  • Virtual key operations return authentication errors.
  • SSO redirect fails or loops.

Diagnosis

# Verify virtual key is active
curl -s -H "Authorization: Bearer $KEY" \
  https://appliance.example.com/v1/models
 
# Check key status in portal: Keys > search for the key label
 
# Check OPA decision for a specific key
curl -s http://127.0.0.1:8181/v1/data/operayde/virtual_keys/allow \
  -d '{"input":{"action":"gateway.chat","params":{"virtual_key":{"revoked":false}}}}'

Common causes

CauseFix
Key revokedCreate a new key in the portal
Key expiredCreate a new key with a later expiry
RPM/TPD budget exhaustedWait for the limit to reset or increase limits
Model not in allow listEdit the key to add the requested model
Clock skew > 30 secondsFix NTP on the appliance
OIDC token expiredSign out and sign in again
IdP group not mappedConfigure group mapping in portal settings

Billing discrepancies

Symptoms

  • Invoice total does not match expected usage.
  • Usage dashboard shows different numbers than the invoice.
  • Missing or duplicate line items.

Diagnosis

  1. Compare time ranges — the invoice covers a fixed billing period; the usage dashboard shows real-time data.
  2. Check aggregation delay — usage data from appliances is batched and may have up to a 15-minute delay.
  3. Verify key attribution — usage is attributed to the key that made the request. If a key was reassigned mid-period, usage splits across both owners.

Resolution

  • For discrepancies > 5%, open a support ticket with the invoice ID and the date range you expect.
  • Operayde support can run a reconciliation report that compares appliance-side metering with central aggregation.

Diagnostic commands

Appliance health

# Overall health
curl -s https://appliance.example.com/v1/health | jq .
 
# Detailed metrics (localhost only)
curl -s http://127.0.0.1:9090/metrics | grep operayde_
 
# Gateway logs (last 100 lines)
journalctl -u operayde-gateway -n 100 --no-pager
 
# Audit emitter logs
journalctl -u operayde-audit-emitter -n 100 --no-pager
 
# Tunnel agent logs (central plane connectivity)
journalctl -u operayde-tunnel-agent -n 100 --no-pager

OPA policy debugging

# Check which bundle is loaded
curl -s http://127.0.0.1:8181/v1/policies | jq '.result[].id'
 
# Evaluate a test decision
curl -s -X POST http://127.0.0.1:8181/v1/data/operayde/rbac/tenant/allow \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "principal": {
        "realm": "tenant:YOUR_TENANT_ID",
        "groups": ["tenant-admin"]
      },
      "action": "config.list-virtual-keys",
      "params": {
        "tenant_id": "YOUR_TENANT_ID"
      }
    }
  }' | jq .

Network diagnostics

# Test central plane connectivity
curl -sS -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" \
  https://ops.<region>.operayde.com/healthz/live
 
# Test DNS resolution
dig appliance.<your-domain>
 
# Check open connections
ss -tnp | grep operayde

Log locations

LogLocationRotation
Gatewayjournalctl -u operayde-gatewaysystemd journal, 500 MB max
Audit emitterjournalctl -u operayde-audit-emittersystemd journal, 500 MB max
Tunnel agentjournalctl -u operayde-tunnel-agentsystemd journal, 500 MB max
OPAjournalctl -u operayde-opasystemd journal, 200 MB max
System/var/log/sysloglogrotate, 7 days
Inference engine/var/log/operayde/inference.loglogrotate, 1 GB max, 3 rotations

Escalation path

SeverityDescriptionResponse timeChannel
P1 — CriticalAppliance down, data loss risk, security incident1 hourEmergency hotline + portal ticket
P2 — HighDegraded performance, partial outage, auth failure4 hoursPortal ticket
P3 — MediumNon-critical bug, billing question, config assistance1 business dayPortal ticket or email
P4 — LowFeature request, documentation question3 business daysPortal ticket or email

How to open a ticket

  1. Go to Settings > Support in the portal.
  2. Select the severity level.
  3. Describe the issue with as much detail as possible.
  4. Attach relevant logs (use the Collect diagnostics button to generate a support bundle automatically).

Collecting a support bundle

From the portal: Appliances > [your appliance] > Collect diagnostics.

This generates a tarball containing:

  • Gateway, audit, and tunnel agent logs (last 24 hours)
  • OPA bundle metadata (not the policies themselves)
  • System resource usage snapshots
  • Network connectivity test results
  • Anonymised configuration (no secrets or keys)

The bundle is uploaded to Operayde support automatically. You can also download it for your own review.