Diagnostic Endpoints — Junior Level¶
Topic: Diagnostic Endpoints Roadmap Focus: What a diagnostic endpoint is. Liveness vs readiness. Your first
/healthzand/metrics. Why a running service exposes URLs that nobody outside the team is ever supposed to call.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Real-World Analogies
- Mental Models
- The First Toolkit
- Code Examples
- Health vs Metrics — Don't Confuse Them
- Use Cases
- Coding Patterns
- Clean Code
- Best Practices
- Edge Cases & Pitfalls
- Common Mistakes
- Tricky Points
- Test Yourself
- Tricky Questions
- Cheat Sheet
- Summary
- What You Can Build
- Further Reading
- Related Topics
- Diagrams & Visual Aids
Introduction¶
Focus: What is a diagnostic endpoint, and why does almost every backend service have a
/healthzURL you've never typed into a browser?
A diagnostic endpoint is a URL (or RPC) that a running service exposes about itself rather than about your business data. /healthz doesn't return an order or a user — it returns "am I alive?". /metrics doesn't return a product page — it returns counts and timings of everything the process has done. /version returns the exact build that's running right now. These endpoints exist so that other software — a load balancer, Kubernetes, a Prometheus scraper, an on-call engineer with curl — can ask a live process questions without stopping it, attaching a debugger, or reading its source.
Here's the mental shift to make early: most of the code you write is for users. Diagnostic endpoints are for operators — the people and machines that keep your service running. A user never sees /healthz. But if /healthz is wrong, the user sees an outage. The endpoint that nobody looks at on a good day is the one everybody depends on during a bad one.
This page covers the two endpoints you'll meet first and use forever: health/readiness checks (the contract between your process and whatever is routing traffic to it) and /metrics (the numbers a monitoring system reads). We'll build a /healthz in Go, Python, Node, and Java/Spring, explain why liveness and readiness are two different questions, and explain why you must never put the database in a liveness check. The next level (middle.md) wires these into Kubernetes probes and adds profiling endpoints. senior.md covers the trade-offs and security. professional.md covers running this safely across a whole fleet.
🎓 Why this matters for a junior: The first time a deploy "succeeds" but the service serves errors, it's almost always a health-check bug — you reported "ready" before you actually were, or "alive" when you were wedged. Getting these two checks right is one of the highest-leverage things a junior can learn, because every service you ever touch has them, and most of them are subtly wrong.
Prerequisites¶
What you should know before reading this:
- Required: How to write and run a small HTTP server in at least one language (Go, Python/Flask/FastAPI, Node/Express, Java/Spring).
- Required: What an HTTP request, an HTTP status code (
200,503), and a URL path are. - Required: The difference between a process (your running program) and the machine/container it runs on.
- Helpful: A rough idea of what a load balancer does — it sits in front of N copies of your service and sends each request to one of them.
- Helpful: A rough idea of what Kubernetes is — software that runs many copies of your container and restarts the ones that look unhealthy. See
../../../DevOps/. - Helpful: Exposure to
../metrics/junior.md. This roadmap is about the endpoints; metrics is about the numbers those endpoints expose.
Glossary¶
| Term | Definition |
|---|---|
| Diagnostic endpoint | A URL/RPC a service exposes about its own state (health, metrics, version, profiles) rather than about business data. |
| Health check | An endpoint that answers "is this process okay?" with a status code. The umbrella term. |
| Liveness | "Is the process alive and not wedged?" If this fails, the right action is to restart the process. |
| Readiness | "Is the process ready to receive traffic right now?" If this fails, the right action is to stop sending it requests (but don't restart). |
| Startup probe | "Has the process finished booting?" Used for slow-starting apps so liveness doesn't kill them mid-boot. |
| Probe | The act of calling a health endpoint on a schedule. Kubernetes and load balancers "probe" your service. |
/metrics | The conventional path where a service exposes counters/gauges/histograms in Prometheus text format. |
| Prometheus | A monitoring system that periodically scrapes (HTTP GETs) every service's /metrics and stores the numbers. |
| Scrape | One HTTP GET of /metrics by a monitoring system. |
| Load balancer (LB) | Routes incoming requests across multiple instances; uses readiness to decide which instances are eligible. |
200 OK / 503 Service Unavailable | The two status codes health endpoints overwhelmingly use: 200 = healthy, 503 = not. |
| Actuator | Spring Boot's built-in set of diagnostic endpoints (/actuator/health, /actuator/metrics, …). |
| expvar | Go's standard-library package that exposes runtime variables as JSON at /debug/vars. |
| pprof | Go's profiling endpoints at /debug/pprof/* (a middle.md topic, named here for vocabulary). |
| Admin port | A separate network port for diagnostic endpoints, so they aren't reachable from the public internet. |
Core Concepts¶
1. A Diagnostic Endpoint Talks About the Service, Not the Data¶
The handler for /orders returns orders. The handler for /healthz returns the service's opinion of itself. This is the defining distinction. When you see a path like /health, /ready, /metrics, /version, /debug/*, /actuator/* — you're looking at the control and introspection surface, not the product. Different audience, different rules, different security posture.
2. Liveness and Readiness Answer Two Different Questions¶
This is the single most important idea on this page, and the one juniors most often get wrong.
- Liveness = "Am I broken in a way only a restart can fix?" Examples of a real liveness failure: a deadlock where every request hangs forever, a corrupted in-memory state, an event loop that stopped turning. The orchestrator's response: kill and restart me.
- Readiness = "Should traffic come to me right now?" Examples of a real readiness failure: I'm still loading a 2 GB model into memory, my database connection pool is temporarily empty, I'm draining before shutdown. The orchestrator's response: leave me running, just stop routing to me.
If you swap them, disaster follows. Put "is the database reachable?" in your liveness check, and a 30-second database blip will make Kubernetes restart every single one of your pods at once — turning a recoverable hiccup into a full outage. (More on this exact failure in senior.md.)
3. The Status Code Is the Answer¶
Health endpoints communicate through the HTTP status code, not the body. 200 means healthy; anything else (almost always 503) means not. The body is for humans reading curl output — it can say why — but the machine probing you only reads the code. Get the code right first.
4. Cheap Checks Beat Thorough Checks¶
A health check runs constantly — every few seconds, on every instance, forever. If your check does real work (queries the DB, calls another service), you've built a tiny load test that runs 24/7 and can amplify outages. A liveness check should be nearly free: "is my process responding to HTTP at all?" often just returns 200 ok and that's correct.
5. /metrics Is a Snapshot, Not a Log¶
/metrics exposes current values of counters and gauges — http_requests_total 48213, goroutines 142. It is read on a schedule by a monitoring system, which stores the time series. You don't call /metrics to "log an event"; you increment a counter in your code, and /metrics reports the running total whenever it's scraped. See ../metrics/junior.md for the signals themselves.
6. These Endpoints Leak Internals — Treat Them Carefully¶
/metrics reveals your traffic volumes and error rates. /version reveals your exact build (and thus its known CVEs). /debug/pprof can dump memory. None of this should be reachable by the public internet. Even as a junior, internalize: diagnostic endpoints are not public endpoints. senior.md and professional.md go deep on this; for now, know that they belong behind auth, on a separate port, or both.
Real-World Analogies¶
| Concept | Real-World Analogy |
|---|---|
| Diagnostic endpoint | The diagnostic port (OBD-II) under your car's dashboard — not for driving, but for the mechanic's scanner. |
| Liveness check | A pulse. No pulse → resuscitate (restart). |
| Readiness check | A "this register is closed" sign at a checkout lane. The cashier is fine; just don't queue here right now. |
| Startup probe | A shop's "opening soon" sign while staff set up — don't judge them as dead before they've unlocked the doors. |
| Putting the DB in a liveness check | Declaring yourself clinically dead because the building's WiFi is down. An overreaction with fatal consequences. |
/metrics | A car's dashboard gauges — speed, RPM, fuel — read at a glance, continuously. |
| Prometheus scraping | A nurse doing rounds every 15 minutes recording everyone's vitals onto a chart. |
| Admin port | A staff-only door at the back of the shop. Customers use the front; operations happen out of sight. |
/version | The "best before / batch number" stamp — tells you exactly which production run this unit came from. |
Mental Models¶
1. The Service Has Two Doors¶
Picture every service as a building with two doors. The front door (your public port, e.g. :8080) is for customers — it serves /orders, /login, /search. The back door (your admin port, e.g. :9090) is for staff — /healthz, /metrics, /debug/*. Customers should never find the back door, and the back door should never be on the same street as the front. When you wire diagnostics, you're building and locking the back door.
2. Liveness Is "Restart Me," Readiness Is "Skip Me"¶
Don't memorize definitions — memorize the action each one triggers. Ask of any failure: "Do I want to be restarted, or do I just want to be skipped?"
- Want a restart → it's a liveness concern.
- Want to be skipped temporarily → it's a readiness concern.
If you can't honestly say "restarting fixes this," it does not belong in liveness.
3. The Probe Is a Question Asked Forever¶
Your health endpoint is not called once. It's called every few seconds for the entire life of the deployment — across every instance. So design it as something that's cheap to ask a million times, not something thorough you'd run once. Every expensive thing you put in it, you pay for continuously.
4. Metrics Are Cumulative; You Read the Difference¶
A Prometheus counter only goes up. http_requests_total is 0 at boot and climbs forever. The useful number — "requests per second right now" — comes from the monitoring system subtracting two scrapes. Your job is just to count honestly and expose the total; the math happens upstream.
The First Toolkit¶
Your day-one diagnostic toolkit is small:
- A
/healthzendpoint that returns200 ok. (zis a Google convention to avoid clashing with a real/healthbusiness page.) - A
/readyzendpoint that returns200only once startup is finished and dependencies you truly need are usable. - A
/metricsendpoint using your language's Prometheus client library. - A
/version(or/buildinfo) endpoint returning the git SHA and build time, so you can answer "did the new version actually deploy?" in onecurl. curl— your client for all of the above.curl -i localhost:8080/healthzshows you the status code (-iprints headers).
That's the whole starter kit. Everything else in this roadmap — pprof, heap dumps, runtime toggles, continuous profiling — is depth on top of these four endpoints.
Code Examples¶
The same four endpoints — /healthz, /readyz, /metrics, /version — in four ecosystems. Read them side by side; the shape is identical everywhere.
Go — net/http, expvar, and Prometheus¶
package main
import (
"encoding/json"
"net/http"
"sync/atomic"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// ready is flipped to true once startup finishes. Atomic so the probe
// goroutine and the startup goroutine don't race on it.
var ready atomic.Bool
var buildInfo = map[string]string{
"version": "1.4.2",
"gitSHA": "a1b2c3d",
"buildTime": "2026-06-11T09:00:00Z",
}
func main() {
mux := http.NewServeMux()
// LIVENESS: dirt cheap. If the process can answer HTTP, it's alive.
// Never touch the DB or downstreams here.
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
})
// READINESS: 200 only once we've finished booting (and, if you truly
// need them, only when required dependencies are usable).
mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
if !ready.Load() {
http.Error(w, "still starting", http.StatusServiceUnavailable)
return
}
w.Write([]byte("ready"))
})
// VERSION: which build is actually running?
mux.HandleFunc("/version", func(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(buildInfo)
})
// METRICS: Prometheus exposition format.
mux.Handle("/metrics", promhttp.Handler())
go warmUp() // simulate slow startup
http.ListenAndServe(":8080", mux)
}
func warmUp() {
// ... load config, prime caches, open the DB pool ...
ready.Store(true) // now and only now do we accept traffic
}
Go's standard library also ships expvar: importing it registers /debug/vars, a JSON blob of runtime counters (memory stats, GC, plus anything you publish). It's the zero-dependency cousin of /metrics:
import (
"expvar"
_ "expvar" // registers /debug/vars on http.DefaultServeMux
)
var ordersProcessed = expvar.NewInt("orders_processed")
// later: ordersProcessed.Add(1)
// curl localhost:8080/debug/vars -> {"orders_processed": 42, "memstats": {...}}
Python — Flask + prometheus_client¶
# pip install flask prometheus_client
from flask import Flask, Response
from prometheus_client import Counter, generate_latest, CONTENT_TYPE_LATEST
import threading
app = Flask(__name__)
_ready = threading.Event() # set once startup completes
BUILD = {"version": "1.4.2", "git_sha": "a1b2c3d", "build_time": "2026-06-11T09:00:00Z"}
requests_total = Counter("http_requests_total", "Total HTTP requests", ["path"])
@app.get("/healthz") # LIVENESS — cheap, no dependencies
def healthz():
return "ok", 200
@app.get("/readyz") # READINESS — only after warm-up
def readyz():
if not _ready.is_set():
return "still starting", 503
return "ready", 200
@app.get("/version")
def version():
return BUILD, 200
@app.get("/metrics") # Prometheus exposition
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
def warm_up():
# ... prime caches, open the DB pool ...
_ready.set()
if __name__ == "__main__":
threading.Thread(target=warm_up, daemon=True).start()
app.run(host="0.0.0.0", port=8080)
Node.js — Express + prom-client¶
// npm i express prom-client
const express = require("express");
const client = require("prom-client");
const app = express();
let ready = false; // flipped true after warm-up
const BUILD = { version: "1.4.2", gitSha: "a1b2c3d", buildTime: "2026-06-11T09:00:00Z" };
const registry = new client.Registry();
client.collectDefaultMetrics({ register: registry }); // event-loop lag, heap, etc.
app.get("/healthz", (_req, res) => res.status(200).send("ok")); // liveness
app.get("/readyz", (_req, res) => // readiness
ready ? res.status(200).send("ready") : res.status(503).send("starting"));
app.get("/version", (_req, res) => res.json(BUILD));
app.get("/metrics", async (_req, res) => { // metrics
res.set("Content-Type", registry.contentType);
res.send(await registry.metrics());
});
app.listen(8080, () => {
setTimeout(() => { ready = true; }, 3000); // simulate slow warm-up
});
Java — Spring Boot Actuator (almost no code)¶
The JVM's answer is Actuator: add one dependency and you get /actuator/health, /actuator/info, /actuator/metrics, and more, for free.
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- For /metrics in Prometheus format: -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.properties
# Liveness vs readiness as SEPARATE health groups (Spring supports this directly):
management.endpoint.health.probes.enabled=true
management.health.livenessstate.enabled=true
management.health.readinessstate.enabled=true
# Prometheus scrape endpoint at /actuator/prometheus
management.endpoints.web.exposure.include=health,info,metrics,prometheus
# Put diagnostics on a SEPARATE port (see senior.md on why this matters):
management.server.port=9090
Now curl localhost:9090/actuator/health/liveness and .../readiness return {"status":"UP"} separately, and /actuator/prometheus is your /metrics. You wrote zero handler code.
Health vs Metrics — Don't Confuse Them¶
Juniors routinely blur these. They answer different questions for different consumers:
| Health/readiness | /metrics | |
|---|---|---|
| Question it answers | "Should I get traffic / be restarted?" | "How much / how fast / how many?" |
| Consumer | Load balancer, Kubernetes | Prometheus, dashboards, alerts |
| Response | A status code (200/503) | A block of numbers |
| Called when | Every few seconds, per instance | Every scrape interval (e.g. 15s) |
| What failure means | "Take action on this instance" | Nothing — it's just data |
| Should it be cheap? | Yes, extremely | Yes (don't compute on scrape) |
A health check that returns rich metrics is over-engineered. A /metrics endpoint that load balancers probe for routing is misused. Keep the lanes separate.
Use Cases¶
| Situation | Endpoint you reach for |
|---|---|
| Kubernetes needs to know when to restart a wedged pod. | Liveness (/healthz). |
| The load balancer needs to know which instances can take traffic. | Readiness (/readyz). |
| Your app loads a huge model and takes 40s to boot. | Startup probe (so liveness doesn't kill it mid-boot). |
| You want a dashboard of request rate and error rate. | /metrics scraped by Prometheus. |
| "Did the hotfix actually deploy, or is the old version still running?" | /version. |
| You're debugging locally and want quick runtime counters with zero deps (Go). | /debug/vars (expvar). |
| On-call needs to confirm a service is up during an incident. | curl the liveness endpoint. |
Coding Patterns¶
Pattern 1 — Separate Liveness From Readiness From Day One¶
mux.HandleFunc("/healthz", liveness) // restart me if this fails
mux.HandleFunc("/readyz", readiness) // skip me if this fails
Even if both return 200 today, having two endpoints means you can evolve them independently without re-plumbing your probes later.
Pattern 2 — The Readiness Flag¶
A single boolean/event flipped at the end of startup is the simplest correct readiness signal. Default it to not ready so you never accept traffic before you're done booting.
Pattern 3 — Liveness Returns a Constant¶
The cheapest possible handler. If the process can run this, it's alive enough to keep. Resist the urge to "make it more useful" by adding checks — that's how you accidentally turn a DB blip into a restart storm.
Pattern 4 — Echo the Build So You Can Trust the Deploy¶
mux.HandleFunc("/version", func(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(buildInfo) // version, gitSHA, buildTime
})
The first question in many incidents is "what's actually running?" A /version endpoint answers it in seconds instead of guessing from CI logs.
Clean Code¶
- Use the conventional paths:
/healthz,/readyz,/metrics,/version. Operators and tooling expect them; don't invent/are-you-ok. - Default readiness to false. A service that's "ready" before it finished booting is worse than one that's slow to come up.
- Keep liveness free of I/O. No DB, no downstream calls, no disk.
- Don't log on every health probe — they fire every few seconds and will drown your logs. (See
../logging/junior.md.) - Put diagnostic handlers in one obvious place (
diagnostics.go,health.py) so the next person finds them instantly. - Never expose secrets via
/versionor/metrics— no DB passwords, no API keys, no full config dumps. (Big topic insenior.md.)
Best Practices¶
- Two endpoints, two questions. Liveness ("restart me") and readiness ("skip me") are distinct. Implement both.
- Liveness must not depend on anything but the process itself. No databases, no caches, no other services.
- Readiness may check required dependencies — sparingly. Only the ones without which you genuinely cannot serve a single request, and even then, prefer "is the pool open?" over "run a query."
- Make checks cheap. They run forever, on every instance.
- Use a real metrics library, not hand-rolled string concatenation —
prometheus_client,prom-client, Micrometer,client_golang. They get the format and escaping right. - Expose
/version. Onecurlshould tell you the exact running build. - Don't put diagnostics on the public internet. Even before you learn the full security story, default to a separate port or localhost binding.
- Test your health logic. A health check that always returns
200even when the process is broken is worse than none — it lies confidently.
Edge Cases & Pitfalls¶
- The "always 200" liveness that lies. If your handler returns
200no matter what, a wedged process is never restarted. (But don't overcorrect — see the DB-in-liveness trap below.) - The database in the liveness check. The classic catastrophe: a brief DB outage makes liveness fail on every pod, the orchestrator restarts them all simultaneously, and now you have zero capacity plus a thundering herd of reconnects. DB belongs (cautiously) in readiness, never liveness.
- Readiness that never flips back. If you set
ready = trueand never set itfalseduring shutdown, the LB keeps sending requests to a draining pod. (Graceful shutdown is amiddle.mdtopic.) - Forgetting the startup window. A 40-second boot plus a liveness probe that starts at second 5 means the orchestrator kills the pod before it ever comes up — an infinite crash loop. Use a startup probe or a generous initial delay.
- Logging every probe. Thousands of
GET /healthz 200lines per hour bury the logs that matter. /metricsdoing work on scrape. Computing expensive values inside the metrics handler means every scrape (every 15s) runs that work. Update metrics as events happen; just report on scrape.- Health endpoint on the public port. Now anyone on the internet can probe your internals and infer your deploy times and traffic.
Common Mistakes¶
- Treating liveness and readiness as the same thing. They trigger opposite actions (restart vs skip). Conflating them turns recoverable issues into outages.
- Putting downstream dependencies in liveness. A dependency blip should never restart your process.
- Returning
200for everything in readiness, so traffic arrives before the service can serve it — users get errors during every deploy. - Hand-writing the Prometheus format. Miss one newline or escape and the whole scrape fails. Use the library.
- Exposing diagnostics publicly.
/metricsand/versionon the open internet hand attackers a reconnaissance map. - No
/versionendpoint, so "is the fix live?" becomes a 20-minute archaeology dig through CI. - Health checks that block. A slow check (full DB query) can time out and be read as a failure, restarting a perfectly fine pod.
- Logging or allocating heavily inside the probe handler. Multiply by "every few seconds, forever" and it adds up.
Tricky Points¶
/healthzreturning200does not mean "the service works." It means "the process answers HTTP." A liveness pass is a low bar by design — that's the point. Don't read more into a green liveness than it claims.- Readiness failing is normal, not an error. During startup and shutdown, returning
503from readiness is the correct behavior. Don't alert on it the way you'd alert on a500. - A counter that resets to 0 looks like a problem but is usually a restart. Prometheus counters reset when the process restarts; monitoring systems handle this, but seeing
http_requests_totaldrop to 0 means "this instance just restarted," not "we lost data." - The status code matters more than the body. A probe reading your endpoint cares about
200vs503. A pretty JSON body with"status": "ok"and an HTTP500will be read as unhealthy — the machine never reads your JSON. /debug/vars(expvar) is registered just by importing the package in Go (a blank import). It's easy to expose it accidentally on your public port. Know what your imports register.- "Ready" and "live" can disagree, and that's healthy. A draining pod is live (don't restart it) but not ready (don't send it traffic). The two endpoints existing separately is what lets you express that.
Test Yourself¶
No answers — for your own honest assessment.
- In your own words, what action does a liveness failure trigger? A readiness failure? Give one realistic example of each that is not in this page.
- Add
/healthz,/readyz,/metrics, and/versionto a small service in your language of choice. Verify each withcurl -i. - Make
/readyzreturn503for the first 5 seconds after boot, then200. Watch it flip with acurlloop. - Explain why putting a database query in your liveness check could turn a 20-second DB hiccup into a multi-minute outage.
- Hit
/metricstwice, 10 seconds apart, while sending some traffic. Find a counter that went up. What's the per-second rate? - Take any service you work on. Find its health endpoints. Are liveness and readiness actually separate, or is one endpoint doing both? Is the DB in the liveness path?
- (Go) Blank-import
expvar, hit/debug/vars, and read thememstatsblock. What's the live heap size?
Tricky Questions¶
Q1: Your /healthz returns 200 but users are getting errors. How is that possible?
Liveness only proves the process answers HTTP. It says nothing about whether the business logic works — a bad config, a broken downstream, or a bug can leave the process "alive" but serving errors. Liveness is intentionally a low bar. Use readiness, metrics (error rate), and traces to catch functional failures; don't expect liveness to.
Q2: Why not just put the database check in liveness so a broken DB connection restarts the pod?
Because restarting won't fix a DB outage — the DB is the problem, not your process. Worse, the DB blip would fail liveness on every pod simultaneously, so the orchestrator restarts your entire fleet at once, and they all slam the recovering DB with reconnects. You've converted a transient dependency issue into a self-inflicted total outage. Dependency health belongs in readiness (cautiously), never liveness.
Q3: A teammate's readiness check always returns 200, even during startup. What breaks?
Traffic gets routed to the instance before it's finished booting (caches cold, pools empty, config unloaded), so the first wave of users hit errors or timeouts on every single deploy. Readiness must return 503 until warm-up completes, and should default to not ready.
Q4: Should /metrics be reachable from the public internet?
No. It exposes your request volumes, error rates, latencies, and often internal endpoint names — a reconnaissance gift to an attacker and a privacy leak about your traffic. Bind it to a separate admin port, localhost, or behind auth. This is covered in depth in senior.md and professional.md.
Q5: What's the difference between /metrics and a log line?
/metrics exposes aggregated current state (totals, gauges) read on a schedule by a monitoring system; it never grows unbounded and isn't per-event. A log line is one discrete event written when something happens. You'd increment a counter and maybe log — they serve different questions. See ../logging/junior.md and ../metrics/junior.md.
Q6: Why the z in /healthz and /readyz?
It's a Google-originated convention to avoid colliding with a real application route called /health and to signal "this is an internal/ops endpoint." It's just a naming habit — /health and /ready are equally fine; the behavior is what matters.
Cheat Sheet¶
┌─────────────────────────── DIAGNOSTIC ENDPOINTS — JUNIOR CHEAT SHEET ───────────────────────────┐
│ │
│ THE FOUR YOU START WITH │
│ /healthz liveness → 200 = "alive" fail ⇒ RESTART me │
│ /readyz readiness → 200 = "send traffic" fail ⇒ SKIP me (don't restart) │
│ /metrics Prometheus → block of numbers read every ~15s by the scraper │
│ /version build info → git SHA + build time "did the deploy land?" │
│ │
│ LIVENESS vs READINESS (memorize the ACTION, not the words) │
│ "Restarting fixes this" → LIVENESS │
│ "Just stop routing to me" → READINESS │
│ DB / downstream in LIVENESS → blip restarts the WHOLE fleet. NEVER. │
│ │
│ GOLDEN RULES │
│ • Status code IS the answer (200 / 503). The body is for humans. │
│ • Liveness must be cheap and dependency-free. │
│ • Readiness defaults to FALSE; flips true after warm-up. │
│ • Update metrics on events; only REPORT on scrape. │
│ • Diagnostics are NOT public. Separate port / localhost / auth. │
│ │
│ CURL IT │
│ curl -i localhost:8080/healthz # -i shows the status code │
│ curl localhost:9090/metrics # admin port in real deploys │
│ │
│ PER-ECOSYSTEM │
│ Go net/http + promhttp ; expvar → /debug/vars │
│ Python prometheus_client + Flask/FastAPI │
│ Node prom-client + express │
│ Java Spring Actuator (/actuator/health/{liveness,readiness}, /actuator/prometheus) │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
Summary¶
- A diagnostic endpoint exposes the service's view of itself — health, metrics, version — for operators and machines, not users.
- Liveness answers "should I be restarted?"; readiness answers "should I get traffic right now?" They trigger opposite actions. Implement both, separately.
- Never put a database or downstream dependency in a liveness check — a dependency blip would restart your whole fleet and turn a hiccup into an outage.
- The HTTP status code is the answer (
200/503). The body is for humans. - Liveness must be cheap and dependency-free; readiness may sparingly check required dependencies and should default to not ready.
/metricsexposes cumulative numbers read on a schedule by a monitoring system; use a real client library, and update counters on events rather than computing on scrape./versionanswers "what's actually running?" in onecurl.- Go ships
expvar(/debug/vars); Spring ships Actuator with first-class liveness/readiness groups; Python and Node useprometheus_client/prom-client. - Diagnostic endpoints leak internals and are not public. Default to a separate admin port or localhost even before you learn the full security story (
senior.md,professional.md).
What You Can Build¶
- A "health-check starter" for your language: a single file exposing
/healthz,/readyz,/metrics,/versionthat you can drop into any new service in two minutes. - A readiness simulator: a service that's "not ready" for a configurable N seconds after boot, so you can watch a
curlloop flip from503to200and feel what readiness gating does. - A
/versionenricher: wire your build system (Go-ldflags, Maven resource filtering, npm build step) to inject the real git SHA and build time at compile, and expose it. - A "two doors" demo: one service listening on
:8080(public) and:9090(admin), with business routes only on the first and diagnostics only on the second. Prove withcurlthat/metricsis unreachable on:8080. - A probe-logger killer: middleware that suppresses access logs for
/healthzand/readyzso your logs stay readable.
Further Reading¶
- Specs & conventions
- Kubernetes — "Configure Liveness, Readiness and Startup Probes": https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
- Prometheus exposition format: https://prometheus.io/docs/instrumenting/exposition_formats/
- Google "kubernetes/community" health-check conventions (the
/healthzzorigin). - Library docs (read once, refer often)
- Go
expvar: https://pkg.go.dev/expvar · Gonet/http/pprof: https://pkg.go.dev/net/http/pprof prometheus_client(Python): https://github.com/prometheus/client_pythonprom-client(Node): https://github.com/siimon/prom-client- Spring Boot Actuator: https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html
- Books
- Site Reliability Engineering (Google) — the health-checking and probing chapters.
Related Topics¶
- Next level up: middle.md — implementing health correctly, pprof/Actuator usage, expvar, log-level toggles, wiring k8s probes.
- Senior level: senior.md — readiness/liveness semantics, cascading failures, probe storms, separate admin port, on-demand profiling in prod.
- Professional level: professional.md — safe live profiling, dumps without OOM, fleet-wide standardization, authz, abuse/DoS of debug endpoints.
- Interview prep: interview.md — questions you'll be asked about health checks and diagnostic endpoints.
- Practice: tasks.md — hands-on labs at each level.
Sibling diagnostic topics:
- Metrics — Junior — the signals
/metricsexposes (this topic is the endpoint). - Logging — Junior — why you must not log every health probe.
- Debugging — Junior — diagnostic endpoints are debugging tools you ship with the service.
Cross-roadmap links:
- DevOps — load balancers and Kubernetes consume your health endpoints.
container-orchestrationandhigh-availability-patternsskills — probe wiring and failover.
Diagrams & Visual Aids¶
The Two Doors¶
┌────────────────────────────────────────┐
public internet ────► │ :8080 FRONT DOOR (customers) │
│ /orders /login /search │
│ │
ops / k8s / LB ────► │ :9090 BACK DOOR (operators) — locked │
│ /healthz /readyz /metrics /version│
└────────────────────────────────────────┘
Liveness vs Readiness → Opposite Actions¶
probe /healthz (LIVENESS) probe /readyz (READINESS)
│ │
200 │ 503 200 │ 503
▼ ▼ ▼ ▼
keep RESTART route STOP routing
running the pod traffic (but keep running)
Startup → Ready → Draining¶
boot ─────────────► warming up ─────────► serving ─────────► draining ───► exit
live: yes yes yes yes
ready: no no YES no ◄── still live!
└ don't route ──┘ └ route ─┘ └ stop routing, finish in-flight
In this topic
- junior
- middle
- senior
- professional