Monitoring — logs and sync health

A healthy Polaris Express install has two log streams worth watching: the server stream (Pino on stdout from the web container) and the device stream (structured logs shipped from iOS through Sync runs). This runbook covers day-to-day inspection, what “healthy” looks like, and which signals matter.

When to run this

Daily, briefly — glance at error counts in the admin console.
After a release — tail logs for 10–15 minutes following a docker compose up -d.
On a user report — pull the affected device’s log timeline.
Weekly — check device_logs table size and retention.

What you’re looking at

Two producers, one wire format (OTelLogRecord):

Directoryweb (server)
- Pino → stdout → docker logs
DirectoryiOS app (device)
- swift-log → JSONL ring buffer
- → POST /api/devices/me/state/sync
- → Postgres device_logs
- → SSE /api/admin/devices/{id}/logs-stream

Both sides scrub PII (emails, JWTs, bearer tokens, E.164 phone numbers, Authorization / Cookie / card_* keys) before anything hits disk or wire. You do not need to add a scrubber.

Server logs

Tail the web container

1
docker compose logs -f --tail=200 web

Pino emits JSON. For human-readable output, pipe through pino-pretty:

1
docker compose logs -f --tail=200 web | docker run -i --rm node:20 \
2
  npx -y pino-pretty

Severity levels

`severity_text`	Pino name	When to care
TRACE / DEBUG	trace/debug	Only with `LOG_LEVEL` = `info` set lower. Noisy.
INFO	info	Normal traffic. Sample, don’t read.
WARN	warn	Worth a scan. Often recoverable.
ERROR	error	Page-worthy if sustained. Always investigate.
FATAL	fatal	Process is dying. Restart loops likely.

Useful filters

1
# Errors and worse, last 1000 lines
2
docker compose logs --tail=1000 web | jq 'select(.severity_number >= 17)'
3

4
# Anything tagged with a specific request id
5
docker compose logs --tail=5000 web | jq 'select(.attributes.req_id == "abc-123")'
6

7
# Slow handlers
8
docker compose logs --tail=5000 web | \
9
  jq 'select(.attributes.duration_ms > 1000)'

Device logs

iOS devices buffer logs locally in a 5 MB ring and flush up to 100 records on each Sync run. The server inserts them into device_logs keyed by (device_id, seq) and they appear in the admin console.

Find them in the admin console

Sign in to the admin console.
Open Devices → [device].
Scroll to the Logs card.
Use the severity, category, and time-range filters at the top.
Click Live tail to open the SSE stream.

The card paginates with keyset pagination (beforeSeq / afterSeq), so jumping deep into history is cheap.

Query the API directly

For scripted checks:

1
# Most recent 100 errors+ for a device
2
curl -s --cookie "$ADMIN_COOKIE" \
3
  "https://admin.example.com/api/admin/devices/$DEVICE_ID/logs?severity=ERROR&limit=100" \
4
  | jq '.logs[] | {ts: .observed_timestamp, msg: .body, cat: .attributes.category}'

1
# Live tail with curl (Ctrl-C to stop)
2
curl -N --cookie "$ADMIN_COOKIE" \
3
  "https://admin.example.com/api/admin/devices/$DEVICE_ID/logs-stream"

Sync health

Device logs only arrive if devices are syncing. Watch these signals.

”Healthy” looks like

Each managed device produces at least one log batch per 15 minutes during business hours.
device_logs.observed_ts for the latest row per device is within the device’s expected Adaptive Cadence window (default ≤ 5 min for active devices).
Server-side device_logs_size_alarm cron has not fired in the last 24 h.

Quiet-device query

Find devices that haven’t synced in an hour:

1
SELECT
2
  d.id,
3
  d.name,
4
  COALESCE(MAX(dl.observed_ts), 'never'::text::timestamptz) AS last_log
5
FROM devices d
6
LEFT JOIN device_logs dl ON dl.device_id = d.id
7
GROUP BY d.id, d.name
8
HAVING COALESCE(MAX(dl.observed_ts), '1970-01-01'::timestamptz)
9
       < now() - interval '1 hour'
10
ORDER BY last_log NULLS FIRST;

Causes, in rough order of likelihood:

Device is offline / app backgrounded for too long.
App was force-quit; logs are queued but won’t flush until next foreground.
Cert pinning or DNS broke after a hostname change.
The device’s auth token expired and the user hasn’t signed back in.

Server-side ingest errors

Failed POST /api/devices/me/state/sync requests will show up in the server log stream:

1
docker compose logs --tail=10000 web | \
2
  jq 'select(.req.url == "/api/devices/me/state/sync" and .res.statusCode >= 400)'

A spike of 401s usually means an auth-token rotation issue; 409s usually mean an idempotency-key collision (benign — the client should retry).

Configure environment variables

Variable	Default	Required	Source	Notes
`LOG_LEVEL`	`info`	no	`web/.env`	Pino level. `trace`, `debug`, `info`, `warn`, `error`, `fatal`.
`LOG_FORMAT`	`json`	no	`web/.env`	Set to `pretty` only in dev; JSON is what shippers need.
`DEVICE_LOG_RETENTION_DAYS`	`7`	no	`web/.env`	Pruned every 6 h by `device_logs_retention_prune`.
`DEVICE_LOG_SIZE_ALARM_BYTES`	`1073741824`	no	`web/.env`	1 GB. Alarms via the daily cron when `device_logs` exceeds this.
`SSE_MAX_CONNECTIONS`	`100`	no	`web/.env`	Cap on concurrent `/logs-stream` subscribers.

Verify

After a fresh install or upgrade, run through this checklist.

Server is logging JSON
Terminal window
```
1
docker compose logs --tail=5 web | jq '.severity_text' | sort -u
```
Expect a non-empty list ("INFO" at minimum). If jq complains about parse errors, you have LOG_FORMAT set to pretty — fine in dev, wrong in prod.
Device logs are landing
```
1
SELECT count(*), max(observed_ts) FROM device_logs;
```
Within 15 minutes of devices being online, count should be nonzero and max(observed_ts) should be recent.
Admin console can read them

Open Devices → [any device] → Logs. You should see rows. Click Live tail, watch for a heartbeat or any incoming record.
Retention cron is registered
Terminal window
```
1
docker compose logs web | grep -i 'device_logs_retention_prune'
```
You should see the cron register at boot and fire every 6 h.

If something goes wrong

No device logs are landing. Check the sync endpoint for 4xx/5xx in the server log:

1
docker compose logs web | jq 'select(.req.url | test("state/sync"))'

If you see only successes but device_logs is empty, the client build may be older than the log-shipping change (pre-logs? in the sync envelope). Update the iOS app.

Live tail shows nothing despite new rows appearing. You’re probably behind a proxy that buffers responses. Disable buffering for /api/admin/devices/*/logs-stream:

1
location ~ ^/api/admin/devices/.+/logs-stream$ {
2
  proxy_buffering off;
3
  proxy_cache off;
4
  proxy_read_timeout 1h;
5
  proxy_pass http://web;
6
}

Also confirm you aren’t running multiple web replicas (see the caution above).

device_logs table is enormous. Two likely causes:

Retention cron failed to run. Check docker compose logs web | grep device_logs_retention_prune.
A misbehaving device is spamming logs. Find it:
```
1
SELECT device_id, count(*) FROM device_logs
2
WHERE observed_ts > now() - interval '1 day'
3
GROUP BY device_id ORDER BY count DESC LIMIT 10;
```
If one device dominates, inspect its log stream for a tight error loop and consider remote-disabling it via the admin console until you can patch the app.

Pino is logging objects as [Object]. You’re piping through a pretty formatter that doesn’t expand nested attributes. Use raw JSON + jq instead.

Audit and rollback

This runbook only performs reads. The only mutation paths are:

Changing LOG_LEVEL — restart web to revert.
Manual TRUNCATE device_logs — irreversible. Take a pg_dump of the table first if you’re not certain.

Migration path

When device_logs exceeds ~5 GB sustained, or you want full-text search across iOS and server logs together, swap the sink to Grafana Loki (or VictoriaLogs) without changing the wire format. See the log-format reference for the migration outline.

Backups — make sure device_logs and the rest of Postgres are getting snapshotted.
Upgrades — what to watch in the log stream during and after a deploy.