Skip to content

Monitoring — logs and sync health

A healthy Polaris Express install has two log streams worth watching: the server stream (Pino on stdout from the web container) and the device stream (structured logs shipped from iOS through Sync runs). This runbook covers day-to-day inspection, what “healthy” looks like, and which signals matter.

  • Daily, briefly — glance at error counts in the admin console.
  • After a release — tail logs for 10–15 minutes following a docker compose up -d.
  • On a user report — pull the affected device’s log timeline.
  • Weekly — check device_logs table size and retention.

Two producers, one wire format (OTelLogRecord):

  • Directoryweb (server)
    • Pino → stdout → docker logs
  • DirectoryiOS app (device)
    • swift-log → JSONL ring buffer
    • POST /api/devices/me/state/sync
    • Postgres device_logs
    • SSE /api/admin/devices/{id}/logs-stream

Both sides scrub PII (emails, JWTs, bearer tokens, E.164 phone numbers, Authorization / Cookie / card_* keys) before anything hits disk or wire. You do not need to add a scrubber.

Terminal window
docker compose logs -f --tail=200 web

Pino emits JSON. For human-readable output, pipe through pino-pretty:

Terminal window
docker compose logs -f --tail=200 web | docker run -i --rm node:20 \
npx -y pino-pretty
severity_textPino nameWhen to care
TRACE / DEBUGtrace/debugOnly with LOG_LEVEL = info set lower. Noisy.
INFOinfoNormal traffic. Sample, don’t read.
WARNwarnWorth a scan. Often recoverable.
ERRORerrorPage-worthy if sustained. Always investigate.
FATALfatalProcess is dying. Restart loops likely.
Terminal window
# Errors and worse, last 1000 lines
docker compose logs --tail=1000 web | jq 'select(.severity_number >= 17)'
# Anything tagged with a specific request id
docker compose logs --tail=5000 web | jq 'select(.attributes.req_id == "abc-123")'
# Slow handlers
docker compose logs --tail=5000 web | \
jq 'select(.attributes.duration_ms > 1000)'

iOS devices buffer logs locally in a 5 MB ring and flush up to 100 records on each Sync run. The server inserts them into device_logs keyed by (device_id, seq) and they appear in the admin console.

  1. Sign in to the admin console.
  2. Open Devices → [device].
  3. Scroll to the Logs card.
  4. Use the severity, category, and time-range filters at the top.
  5. Click Live tail to open the SSE stream.

The card paginates with keyset pagination (beforeSeq / afterSeq), so jumping deep into history is cheap.

For scripted checks:

Terminal window
# Most recent 100 errors+ for a device
curl -s --cookie "$ADMIN_COOKIE" \
"https://admin.example.com/api/admin/devices/$DEVICE_ID/logs?severity=ERROR&limit=100" \
| jq '.logs[] | {ts: .observed_timestamp, msg: .body, cat: .attributes.category}'
Terminal window
# Live tail with curl (Ctrl-C to stop)
curl -N --cookie "$ADMIN_COOKIE" \
"https://admin.example.com/api/admin/devices/$DEVICE_ID/logs-stream"

Device logs only arrive if devices are syncing. Watch these signals.

  • Each managed device produces at least one log batch per 15 minutes during business hours.
  • device_logs.observed_ts for the latest row per device is within the device’s expected Adaptive Cadence window (default ≤ 5 min for active devices).
  • Server-side device_logs_size_alarm cron has not fired in the last 24 h.

Find devices that haven’t synced in an hour:

SELECT
d.id,
d.name,
COALESCE(MAX(dl.observed_ts), 'never'::text::timestamptz) AS last_log
FROM devices d
LEFT JOIN device_logs dl ON dl.device_id = d.id
GROUP BY d.id, d.name
HAVING COALESCE(MAX(dl.observed_ts), '1970-01-01'::timestamptz)
< now() - interval '1 hour'
ORDER BY last_log NULLS FIRST;

Causes, in rough order of likelihood:

  1. Device is offline / app backgrounded for too long.
  2. App was force-quit; logs are queued but won’t flush until next foreground.
  3. Cert pinning or DNS broke after a hostname change.
  4. The device’s auth token expired and the user hasn’t signed back in.

Failed POST /api/devices/me/state/sync requests will show up in the server log stream:

Terminal window
docker compose logs --tail=10000 web | \
jq 'select(.req.url == "/api/devices/me/state/sync" and .res.statusCode >= 400)'

A spike of 401s usually means an auth-token rotation issue; 409s usually mean an idempotency-key collision (benign — the client should retry).

VariableDefaultRequiredSourceNotes
LOG_LEVELinfonoweb/.envPino level. trace, debug, info, warn, error, fatal.
LOG_FORMATjsonnoweb/.envSet to pretty only in dev; JSON is what shippers need.
DEVICE_LOG_RETENTION_DAYS7noweb/.envPruned every 6 h by device_logs_retention_prune.
DEVICE_LOG_SIZE_ALARM_BYTES1073741824noweb/.env1 GB. Alarms via the daily cron when device_logs exceeds this.
SSE_MAX_CONNECTIONS100noweb/.envCap on concurrent /logs-stream subscribers.

After a fresh install or upgrade, run through this checklist.

  1. Server is logging JSON

    Terminal window
    docker compose logs --tail=5 web | jq '.severity_text' | sort -u

    Expect a non-empty list ("INFO" at minimum). If jq complains about parse errors, you have LOG_FORMAT set to pretty — fine in dev, wrong in prod.

  2. Device logs are landing

    SELECT count(*), max(observed_ts) FROM device_logs;

    Within 15 minutes of devices being online, count should be nonzero and max(observed_ts) should be recent.

  3. Admin console can read them

    Open Devices → [any device] → Logs. You should see rows. Click Live tail, watch for a heartbeat or any incoming record.

  4. Retention cron is registered

    Terminal window
    docker compose logs web | grep -i 'device_logs_retention_prune'

    You should see the cron register at boot and fire every 6 h.

No device logs are landing. Check the sync endpoint for 4xx/5xx in the server log:

Terminal window
docker compose logs web | jq 'select(.req.url | test("state/sync"))'

If you see only successes but device_logs is empty, the client build may be older than the log-shipping change (pre-logs? in the sync envelope). Update the iOS app.

Live tail shows nothing despite new rows appearing. You’re probably behind a proxy that buffers responses. Disable buffering for /api/admin/devices/*/logs-stream:

location ~ ^/api/admin/devices/.+/logs-stream$ {
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 1h;
proxy_pass http://web;
}

Also confirm you aren’t running multiple web replicas (see the caution above).

device_logs table is enormous. Two likely causes:

  1. Retention cron failed to run. Check docker compose logs web | grep device_logs_retention_prune.

  2. A misbehaving device is spamming logs. Find it:

    SELECT device_id, count(*) FROM device_logs
    WHERE observed_ts > now() - interval '1 day'
    GROUP BY device_id ORDER BY count DESC LIMIT 10;

    If one device dominates, inspect its log stream for a tight error loop and consider remote-disabling it via the admin console until you can patch the app.

Pino is logging objects as [Object]. You’re piping through a pretty formatter that doesn’t expand nested attributes. Use raw JSON + jq instead.

This runbook only performs reads. The only mutation paths are:

  • Changing LOG_LEVEL — restart web to revert.
  • Manual TRUNCATE device_logs — irreversible. Take a pg_dump of the table first if you’re not certain.

When device_logs exceeds ~5 GB sustained, or you want full-text search across iOS and server logs together, swap the sink to Grafana Loki (or VictoriaLogs) without changing the wire format. See the log-format reference for the migration outline.

  • Backups — make sure device_logs and the rest of Postgres are getting snapshotted.
  • Upgrades — what to watch in the log stream during and after a deploy.