controller-runtime — Professional¶
1. What "running an operator in production" means¶
You're not just running a Go program; you're running a privileged control-plane component on a shared Kubernetes cluster. That changes the engineering job:
- Availability. The operator is the only thing that reconciles its CRs. If it's down, those resources stop converging — even if the rest of the cluster is healthy.
- Blast radius. A buggy reconciler can update thousands of objects in a tight loop. Rate limits, scoping, and rollback strategies aren't optional.
- Multi-tenancy. RBAC, namespace scoping, and webhook safety controls determine what the operator can break.
- Observability. "Did the operator do the right thing?" is the only question that ever matters in an incident, and the answer must be evident from metrics and logs.
The rest of this file is the production checklist.
2. Leader election¶
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
LeaderElection: true,
LeaderElectionID: "widget-operator.example.com",
LeaderElectionNamespace: "widget-system",
LeaderElectionResourceLock: "leases",
LeaseDuration: ptr.To(15 * time.Second),
RenewDeadline: ptr.To(10 * time.Second),
RetryPeriod: ptr.To(2 * time.Second),
LeaderElectionReleaseOnCancel: true,
})
The defaults are reasonable. The settings to think about:
| Knob | Effect |
|---|---|
LeaseDuration | How long another candidate must wait before assuming leadership |
RenewDeadline | If the leader can't renew in this window, it stops processing |
RetryPeriod | How often candidates poll the lease |
ReleaseOnCancel: true | On graceful shutdown, hand off immediately; without it, failover waits LeaseDuration |
Always enable leader election when running > 1 replica. Without it, two managers process the same events and fight over writes — exactly the split-brain you wanted to avoid.
3. Health and readiness probes¶
mgr.AddHealthzCheck("ping", healthz.Ping)
mgr.AddReadyzCheck("informers", func(req *http.Request) error {
if !mgr.GetCache().WaitForCacheSync(req.Context()) {
return errors.New("cache not synced")
}
return nil
})
The contract for a Kubernetes-aware operator:
- Liveness: always succeed unless the process is wedged. Failing liveness restarts the pod — only use for true deadlocks.
- Readiness: succeed once the cache is synced and we're the leader (or running non-leader-elected). A non-leader pod is running but not ready in this model — the Service won't send traffic, and
kubectl rollout statuswaits for it.
In the Deployment:
livenessProbe:
httpGet: { path: /healthz, port: 8081 }
initialDelaySeconds: 15
readinessProbe:
httpGet: { path: /readyz, port: 8081 }
periodSeconds: 5
4. Metrics and dashboards¶
The manager exposes Prometheus metrics on metricsserver.Options.BindAddress. The non-negotiable panels for an on-call dashboard:
| Panel | Metric (example) | Alert when |
|---|---|---|
| Reconcile error rate | rate(controller_runtime_reconcile_errors_total[5m]) | Sustained > 0.1/s |
| Reconcile latency p99 | histogram_quantile(0.99, controller_runtime_reconcile_time_seconds_bucket) | > 1s |
| Workqueue depth | workqueue_depth{name="widget-controller"} | Climbing without plateau |
| Workqueue oldest unfinished | workqueue_unfinished_work_seconds{name=...} | > 60s |
| Leader status | leader_election_master_status | No leader for > 30s |
| API server QPS | rest_client_requests_total (rate) | Higher than your client-go QPS budget |
The last one is sneaky: a misconfigured controller can DDoS its own API server. The rest.Config carries QPS (default 20) and Burst (default 30). For a single operator pod that's enough; if you run many, raise these and keep an eye on the metric.
5. Structured logging¶
import "sigs.k8s.io/controller-runtime/pkg/log/zap"
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&zap.Options{
Development: false,
TimeEncoder: zapcore.RFC3339TimeEncoder,
})))
Production log discipline:
Infofor state transitions ("reconciled", "finalizer added", "deleted"). Skip "starting reconcile" lines — there are too many.Erroronly for actionable errors. ANotFoundyou swallowed is not an error. A failedUpdateafter retries is.- Always include the object key via
ctrl.LoggerFrom(ctx)— the manager injectscontroller=...,name=...,namespace=.... - No per-reconcile logs without correlation IDs. Status-spam logs hide the line you actually want to see.
6. RBAC scoping in practice¶
The naïve operator has a ClusterRole granting * on all the kinds it touches. The production operator scopes by:
- Namespace — if your CRs only exist in one namespace, switch the manager to namespaced mode (
DefaultNamespaces) and grant aRoleinstead of aClusterRole. - Verbs — never grant
deleteif you don't delete. Never grant*if you can enumerate. - Resources —
secretsis the highest-value subset; grant it only where necessary, ideally withresourceNamesif you only need specific Secrets. - Sub-resources —
pods/exec,pods/portforward,*/scaleare separately grantable. Don't ask for them unless used.
- apiGroups: ["apps.example.com"]
resources: ["widgets"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["apps.example.com"]
resources: ["widgets/status", "widgets/finalizers"]
verbs: ["update", "patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Run kubectl auth can-i --as=system:serviceaccount:widget-system:widget-controller list secrets -A after deploying to verify what the operator can actually do.
7. Multi-namespace patterns¶
Three deployment topologies, ordered by isolation:
| Topology | When |
|---|---|
| One cluster-wide operator | Cluster-scoped CRDs or shared platform feature |
| One operator per namespace (no leader election across) | Strict tenant isolation; each tenant gets its own pod |
| One operator, namespaced cache, cluster-wide CRDs | Most common middle ground — operator only touches enumerated namespaces |
For the third, the cache config:
Cache: cache.Options{
DefaultNamespaces: map[string]cache.Config{
"tenant-a": {},
"tenant-b": {},
},
},
The manager will then only list+watch those namespaces. Add namespaces dynamically only by restarting the pod — cache namespace lists are sealed at start.
8. Graceful shutdown¶
ctx := ctrl.SetupSignalHandler()
if err := mgr.Start(ctx); err != nil {
setupLog.Error(err, "manager exited")
os.Exit(1)
}
SetupSignalHandler installs handlers for SIGINT and SIGTERM. On signal:
- The returned context is cancelled.
- Each
Runnable's context goes too. - Controllers stop pulling from their work-queues but complete in-flight reconciles.
- The webhook server stops accepting connections after a drain period.
- Leader election is released (with
ReleaseOnCancel: true). mgr.Startreturns.
Kubernetes' terminationGracePeriodSeconds (default 30s) must be longer than your slowest reconcile, or the pod is SIGKILLed mid-write and a half-applied state ships. For a controller that talks to external systems, 60–120s is reasonable; document it.
9. Deployment manifest checklist¶
A minimal but real Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: widget-operator
namespace: widget-system
spec:
replicas: 2
selector:
matchLabels: { app: widget-operator }
template:
metadata:
labels: { app: widget-operator }
spec:
serviceAccountName: widget-operator
terminationGracePeriodSeconds: 60
securityContext:
runAsNonRoot: true
seccompProfile: { type: RuntimeDefault }
containers:
- name: manager
image: ghcr.io/example/widget-operator:v0.4.2
args: ["--leader-elect", "--metrics-bind-address=:8080"]
ports:
- { containerPort: 8080, name: metrics }
- { containerPort: 8081, name: probe }
- { containerPort: 9443, name: webhook }
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1, memory: 512Mi }
env:
- name: GOMEMLIMIT
value: "450MiB"
livenessProbe: { httpGet: { path: /healthz, port: probe } }
readinessProbe: { httpGet: { path: /readyz, port: probe } }
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities: { drop: ["ALL"] }
Three details:
replicas: 2with leader election. Exactly one is active; the other is hot standby.GOMEMLIMITmatches the container limit minus headroom — see the memory-management module.- Read-only root + dropped caps. An operator never needs to write the filesystem; if it does, you have a different problem.
10. Webhooks in production¶
Validating webhooks should:
- Fail closed for important checks (
failurePolicy: Fail). A degraded webhook blocking creates is better than letting through invalid resources. - Fail open for advisory checks (
failurePolicy: Ignore). A label policy that occasionally lets a deploy through is better than an outage. - Be fast — < 100ms p99. The API server calls webhooks synchronously on every admission request for the matching kinds.
- Be idempotent — they may be called multiple times for one create due to retries.
- Never depend on the webhook's own controller being ready — if the webhook gates resources the controller needs, you've built a deadlock.
Mutating webhooks have one extra rule: never change fields the user owns. Set defaults, add labels, write annotations. Don't override spec.replicas.
11. CRD lifecycle and conversion¶
Your CRDs are part of the operator deployment. Two strategies:
- Operator installs CRDs. The operator's helm chart includes the CRD YAML;
controller-gengenerates it. Simple but couples upgrades. - CRDs installed separately. A platform team installs them out-of-band; the operator only assumes they exist. Better for shared clusters.
For multi-version CRDs (you bumped v1alpha1 to v1), implement a conversion webhook:
ctrl.NewWebhookManagedBy(mgr).For(&v1.Widget{}).Complete()
// In v1alpha1:
func (src *Widget) ConvertTo(dstRaw conversion.Hub) error { ... }
func (dst *Widget) ConvertFrom(srcRaw conversion.Hub) error { ... }
Pick one hub version that knows the full schema; spokes convert to/from it. The API server invokes the webhook when a client asks for an object in a different version than is stored.
12. Error budgets for reconcilers¶
A useful SLO for an operator is "fraction of reconciles that converged in one step". Approximated as:
For most controllers, healthy is > 0.95. A drop means either:
- The cluster genuinely has a lot of churn (look at watched-kind change rate).
- Your reconcile is wrong and keeps not converging (look at top requeue reasons in logs).
Error budgets aren't just dashboarding — they're the contract that lets you say "this PR slowed the operator down by 12% of budget, hold the release".
13. Upgrade and rollback safety¶
Operator upgrades are control-plane upgrades. Two practical rules:
- The new operator must be safe with old CRs. If you renamed
.spec.footo.spec.bar, the upgrade must read both. Otherwise existing resources break during the rollout. - The old operator must be safe with new CRs. During the rollout, both versions are running. The old version sees new resources and will likely fail to reconcile them — that's fine, but it must not panic or corrupt state.
In practice that means never delete or rename CRD fields. Add new fields, deprecate, and eventually remove in a subsequent major version. Use webhooks to enforce defaults so old objects work after the new operator starts.
14. Observability — events, logs, traces¶
| Signal | Use for |
|---|---|
EventRecorder | User-facing lifecycle messages; show up in kubectl describe |
| Structured logs | Operator debugging; include reconcile key always |
| Metrics | Aggregate behavior; SLO tracking; alerting |
| Traces | Latency attribution across the reconcile + downstream API calls (controller-runtime supports OpenTelemetry spans on Reconcile) |
A real operator emits at least one Kubernetes Event per state transition on every CR — that's what end-users actually read.
15. Summary¶
Running an operator in production is leader election plus health probes plus scoped RBAC plus a tight observability loop on top of the reconcile pattern. Bound the blast radius with namespaced caches and label selectors. Treat webhooks as critical infrastructure and benchmark them. Build dashboards on controller-runtime_reconcile_* and workqueue_*, alert on error rates and oldest-unfinished-work. Never roll out a CRD change that breaks the previous operator version mid-rollout. The reconcile loop is simple; the operational discipline around it is what makes the operator boring — which is the goal.
Further reading¶
- Operator best practices (SDK): https://sdk.operatorframework.io/docs/best-practices/
- Kubernetes leader election design: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/leader-election.md
- Server-side apply field management: https://kubernetes.io/docs/reference/using-api/server-side-apply/#managers
- Webhook configuration: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
- CRD versioning: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/