Operator Runbook

This is a minimal operational checklist for keeping Blueprint services healthy and safe in production.

Daily Checks

Verify the Blueprint Manager process is running and connected to RPC + WS endpoints.
Confirm service heartbeats are progressing (no sustained gaps).
Review job error rates and retry spikes.
Check disk usage for cache + data directories.

Key Signals to Watch

Heartbeat drift: late or missing heartbeats can trigger QoS degradation.
Job queue backlog: growing queues indicate capacity pressure.
RPC latency: slow RPCs lead to missed service events.
Crash loops: repeated restarts usually imply config or artifact issues.

Incident Response

Pause new work by stopping the manager.
Capture logs + recent job failures for root cause.
Restore service with a known-good config and pinned artifact versions.
Run a small validation job before resuming full traffic.

Capacity Planning

Reserve headroom for spikes in service requests and simulations.
Size storage for artifacts + per-service data.
Isolate noisy workloads into separate hosts when possible.

Security Hygiene

Keep keystores isolated and use least-privilege access.
Rotate operator keys on schedule.
Use separate RPC credentials per environment.

Sandboxing and Security Join as Operator