The SPACE Framework — Interview Questions¶

Roadmap: Engineering Metrics & DORA → The SPACE Framework A SPACE interview rarely asks "name the five dimensions" and stops. It asks "an exec wants to measure productivity by commits — what do you say?" and then watches whether you can explain why a single activity number is worse than no number, defend perceptual metrics as real data, and design a measurement a team would actually trust. This page is the question bank, with model answers and a note on what each question is really probing.

Table of Contents¶

How to Use This Page
Theme 1 — What SPACE Is and Why It Exists
Theme 2 — The Five Dimensions
Theme 3 — The Guidance: How to Apply SPACE
Theme 4 — SPACE vs DORA
Theme 5 — DevEx: Operationalizing SPACE
Theme 6 — Scenario and Judgment
Theme 7 — Measurement Validity
Rapid-Fire Round
Red Flags and Green Flags
Summary
Further Reading
Related Topics

How to Use This Page¶

Each question carries three things: Q (the prompt), what the interviewer is really testing, and A (a model answer at the depth a strong candidate gives). Don't memorize the answers — internalize the moves they keep returning to:

productivity is multidimensional (one number always omits something that then gets sacrificed)
activity is not output, output is not outcome (commits measure typing, not value)
perceptual data is real data (a developer's report of friction is a measurement, not an opinion to discount)
measure the team, not the individual (the unit of software delivery is the team; ranking people corrupts the signal)

Nearly every question in this bank is one of those four moves wearing a costume. The candidates who do well are the ones who name the failure mode — Goodhart, the McNamara fallacy, gaming — before reaching for a metric. SPACE is not a dashboard you install; it's a framework for choosing what to measure so the measurement doesn't backfire.

Theme 1 — What SPACE Is and Why It Exists¶

Q1.1 — What is the SPACE framework, in one breath, and what problem was it built to solve?¶

Testing: Whether you can state the thesis — productivity is multidimensional — rather than just list five words.

A. SPACE is a framework for measuring developer productivity that insists productivity is multidimensional and can't be captured by any single metric. It was published in 2021 by Nicole Forsgren, Margaret-Anne Storey, and colleagues, and the acronym names five dimensions you should sample across: Satisfaction & wellbeing, Performance, Activity, Communication & collaboration, and Efficiency & flow. The problem it was built to solve is the industry's habit of reaching for one convenient number — lines of code, commits, story points, velocity — and treating it as "productivity." Every such number is a proxy that captures one slice and silently omits the rest, so optimizing it degrades the parts you didn't measure. SPACE's core move is: pick metrics from multiple dimensions so no single proxy can be gamed into the ground.

Q1.2 — Why does a single productivity metric always fail? Make the argument, don't just assert it.¶

Testing: Whether you understand the mechanism of single-metric failure, not just that it's frowned upon.

A. Two reinforcing reasons. First, productivity is genuinely multidimensional — a developer who ships fast but burns out, or whose "output" is rework, or who blocks four teammates to look busy, is not productive in any useful sense. Any single number measures one axis and is blind to the others, so it can rise while the thing you actually care about falls. Second, Goodhart's law: once a metric becomes a target, people optimize the metric, not the underlying goal. Measure commits, you get commit-splitting. Measure lines, you get verbose code and deleted-code penalized. Measure velocity, you get point inflation. The single metric doesn't just fail to capture productivity — it actively distorts behavior toward whatever it does capture. SPACE's defense is to triangulate: a metric is much harder to game when it's checked against metrics from other dimensions that would move the wrong way if you gamed it.

Q1.3 — Name the five dimensions and give a one-line gloss of each.¶

Testing: Baseline knowledge — but the gloss reveals whether you actually understand each one.

A. - S — Satisfaction & wellbeing: how fulfilled, healthy, and unburned-out developers are; includes whether they'd recommend their team and whether the tooling frustrates them. - P — Performance: the outcome of the work — did it do its job? Quality, reliability, customer impact. Outcomes, not output. - A — Activity: counts of actions — commits, PRs, builds, deploys, code reviews completed. The easy-to-measure, easy-to-misuse dimension. - C — Communication & collaboration: how well work and knowledge flow across people — discoverability of docs, review quality, onboarding speed, how integrated the team is. - E — Efficiency & flow: the ability to make progress with minimal friction, interruption, and delay — uninterrupted focus time, handoff wait times, value-stream flow.

The thing to convey is that Activity is the only purely mechanical one — the other four are about quality, experience, and outcomes — which is exactly why naïve programs over-index on Activity and miss the point.

Q1.4 — Is SPACE a set of metrics you can adopt off the shelf?¶

Testing: Whether you mistake a framework for a dashboard. A very common misconception.

A. No — and this is the most common misunderstanding. SPACE is not a list of metrics; it's a framework for choosing your own metrics. The five dimensions are categories to sample across, and the paper deliberately gives example metrics rather than prescribed ones, because the right metrics depend on your context, your goals, and what you can measure without distorting behavior. Treating SPACE as "track these five numbers" recreates exactly the single-dimensional, gameable dashboard it was written to prevent. The correct use is: decide what question you're trying to answer, then pick a handful of metrics spanning at least three dimensions, mixing measurement types, scoped to the team.

Theme 2 — The Five Dimensions¶

Q2.1 — Why is Activity called "the trap"? Walk through how it goes wrong.¶

Testing: The single most important nuance in the framework — that the easiest dimension to measure is the most dangerous.

A. Activity is the trap because it is the easiest to measure and the least meaningful in isolation. Commits, PRs, lines, and deploys are all sitting right there in the version-control and CI systems — no survey, no instrumentation — so a team under pressure to "show productivity" grabs them first. The failure is that activity measures motion, not progress. Twenty commits can be one feature or twenty rebases of nothing; a thousand lines can be value or boilerplate that the next person deletes. Worse, the moment Activity becomes a target, it's trivially gameable: split one commit into ten, pad PRs, avoid the high-value refactor because it shows as negative lines. And it punishes exactly the work you want — deleting code, mentoring, thinking before typing — none of which generate activity counts. So Activity isn't useless; it's a context signal that's only safe when balanced by Performance (did the activity produce a good outcome?) and Satisfaction (is the activity sustainable?). Used alone, it's the textbook Goodhart disaster.

Q2.2 — Give two or three concrete example metrics for each dimension.¶

Testing: Whether your knowledge is concrete or hand-wavy.

A. - Satisfaction & wellbeing: developer satisfaction (eNPS-style "would you recommend your team"), burnout risk, satisfaction with tooling and the development environment, retention. - Performance: change failure rate, reliability/uptime against SLOs, customer-reported quality, whether shipped features achieved their intended outcome. - Activity: number of commits/PRs, code-review volume, deployment frequency (as a count), build/test counts, documentation written. - Communication & collaboration: PR review thoroughness and turnaround, quality and discoverability of documentation, onboarding time for a new hire, how well work is integrated across the team, network-of-collaboration metrics. - Efficiency & flow: amount of uninterrupted focus time, number of handoffs in a process, wait time between stages, perceived ease of getting work done, value-stream flow efficiency (active time ÷ total lead time).

Note that several of these — satisfaction, perceived ease, review quality — are perceptual, gathered by survey, and that's by design.

Q2.3 — Why is Performance the hardest dimension to measure directly, and what do teams do about it?¶

Testing: Whether you grasp the output-vs-outcome distinction and the attribution problem.

A. Performance is hard because it's about outcomes, not output, and outcomes are several causal steps removed from the engineer's keyboard. You can count what a developer produced (output: PRs, features), but performance asks whether what they produced did its job — was it reliable, did customers adopt it, did revenue or retention move? Three problems follow. First, attribution: a feature's business outcome depends on product, design, market timing, and luck, not just code quality, so crediting it to one engineer or even one team is dubious. Second, lag: the outcome often isn't visible for weeks or quarters after the code ships. Third, the temptation to substitute output for outcome because output is countable — which collapses Performance back into Activity. What teams do: use proxy outcome signals that are closer to the team's control — change failure rate, reliability against SLOs, defect escape rate, quality as perceived by users — while staying honest that these are proxies, and pairing them with the harder business-outcome conversation rather than pretending a commit count answers it.

Q2.4 — A team wants to add wellbeing to its metrics but says "you can't measure feelings." How do you respond?¶

Testing: Whether you'll defend perceptual measurement instead of caving to the "soft = unmeasurable" reflex.

A. You absolutely can measure it, and the data is real. Satisfaction and burnout are measured the way the rest of the world measures subjective states — with validated survey instruments, asked consistently, tracked as a trend, not a one-off. eNPS-style questions ("would you recommend your team as a place to work?") and burnout inventories are predictive: low satisfaction and high burnout lead, with a lag, to attrition and falling delivery performance. So wellbeing isn't a fuzzy nice-to-have; it's an early-warning leading indicator for outcomes you definitely care about. The honest caveat is that perceptual data must be collected well — anonymized, consistent wording, trended over time, never tied to individual identity or compensation — or it stops being honest. But "you can't measure feelings" is the McNamara fallacy in miniature: declaring the thing that's hard to count unimportant. The fix is to count it properly, not to drop it.

Theme 3 — The Guidance: How to Apply SPACE¶

Q3.1 — What's the headline guidance for using SPACE well? Give the rules.¶

Testing: Whether you know the framework's actual prescriptions, not just its vocabulary.

A. Three rules, and they're the whole point of the framework: 1. Pick metrics from at least three dimensions. Never measure one; the cross-check between dimensions is what makes the metric set hard to game and honest about tradeoffs. If Activity goes up while Satisfaction goes down, you've learned something a single number would have hidden. 2. Combine different types of metric — perceptual (surveys: how people feel about the work), system (telemetry: data from the tools, like deploy frequency or build time), and workflow (process: how work moves, like wait times and handoffs). Triangulating across types guards against any one type's blind spots: telemetry misses how it feels, surveys miss what's actually happening in the system. 3. Measure at the team level, not the individual. Software is built by teams; individual metrics invite ranking, gaming, and the destruction of collaboration. SPACE is explicit that these metrics are for understanding and improving systems and teams, not for evaluating people.

Q3.2 — Why "at least three dimensions"? What does the third one buy you over two?¶

Testing: Whether you understand the anti-gaming logic, not just the rule.

A. Each additional dimension closes a gaming loophole and exposes a tradeoff the others can't see. With one dimension you're fully exposed to Goodhart — optimize it, distort everything else invisibly. With two, you catch the obvious tension (e.g., Activity up but Performance flat means the activity was busywork), but a clever optimization can still satisfy both while wrecking a third. Adding the third — typically a human dimension like Satisfaction or Flow — catches the most insidious failure: hitting your throughput and quality numbers by burning people out or destroying focus time, which looks great this quarter and collapses next quarter as attrition and slowdown arrive. Three dimensions, deliberately chosen to be in tension, make it very hard to "win" the metrics without actually improving, because gaming one almost always shows up as a regression in another. That mutual-constraint property is the entire defense.

Q3.3 — Why is "team, not individual" so emphatic in SPACE, and what breaks when you ignore it?¶

Testing: Whether you understand the structural reasons, not just "it feels bad to rank people."

A. Several structural reasons, all pointing the same way. The unit of delivery is the team — value ships through collaboration, review, and integration, so most of what produces good outcomes happens between people and isn't attributable to one. Individual metrics destroy the collaboration they depend on: if I'm ranked on my commits, helping you review, mentoring a junior, or doing the unglamorous reliability work all become irrational — they help the team and hurt my number. They're trivially gameable at the individual level because one person controls their own commit count. And they punish the highest-value contributors, who often produce fewer but more leveraged artifacts (a senior who deletes 2,000 lines or unblocks four people scores terribly on activity). What breaks when you ignore this is the team's actual productivity: people optimize their personal numbers, stop helping each other, and the system metrics may even look fine while the team rots. SPACE is unambiguous that this is the wrong use of the data.

Q3.4 — Distinguish perceptual, system, and workflow metrics with an example of each, and say why you need all three.¶

Testing: Whether you can apply the metric-type taxonomy concretely.

A. - Perceptual — how people experience the work, gathered by survey. Example: "How satisfied are you with the time it takes to get a code review?" Captures lived friction that no log records. - System — telemetry emitted by the tools. Example: deployment frequency, mean build time, automated test pass rate. Objective and continuous, but blind to how it feels and to work that happens off-system. - Workflow — how work moves through the process. Example: wait time between "PR opened" and "review started," number of handoffs, flow efficiency. Captures structural delay between the stages.

You need all three because each has a blind spot the others cover. System data says deploys are frequent but can't tell you developers dread every release. Surveys say morale is low but can't tell you which stage is the bottleneck. Workflow data finds the bottleneck but not whether fixing it would actually relieve the felt pain. Triangulating the three turns three partial pictures into one trustworthy one — which is exactly why SPACE tells you to combine types, not just dimensions.

Theme 4 — SPACE vs DORA¶

Q4.1 — How do SPACE and DORA relate? Are they competitors?¶

Testing: The most common confusion in this whole area — whether you see them as complementary, with different scopes.

A. They're complementary, not competing, and they operate at different scopes. DORA measures software delivery performance — four key metrics (deployment frequency, lead time for changes, change failure rate, failed-deployment recovery time) that quantify how well an organization ships and operates software. It's tightly focused on the delivery pipeline and its outcomes. SPACE is the broader framework for developer productivity as a whole, of which delivery is only one part — it explicitly includes dimensions DORA doesn't touch, especially Satisfaction & wellbeing and Efficiency & flow (the human and friction side). The clean way to say it: DORA tells you how well you deliver; SPACE tells you how productive and healthy your engineering is, of which delivery is one dimension. They share authorship and DNA — Nicole Forsgren is central to both — and they nest neatly: the DORA metrics are excellent candidates for SPACE's Performance and Activity dimensions, and a mature program runs DORA for delivery while using SPACE to make sure it isn't hitting delivery numbers by burning people out.

Q4.2 — Where exactly do the DORA metrics land inside the SPACE dimensions?¶

Testing: Whether your "they're complementary" answer is concrete or just a slogan.

A. They map cleanly. Deployment frequency is an Activity count (and an organizational throughput signal). Lead time for changes is Efficiency & flow — it's literally the flow time from commit to production. Change failure rate is Performance — an outcome/quality measure of whether what you shipped worked. Failed-deployment recovery time is Performance too (reliability/resilience outcome). What's striking is what DORA doesn't cover in SPACE terms: there's no DORA metric for Satisfaction & wellbeing and none for Communication & collaboration. That gap is precisely the argument for using both — DORA gives you a battle-tested, benchmarked read on delivery (three of five dimensions, partially), and SPACE reminds you to also instrument the human dimensions DORA is silent on, so you don't optimize delivery at the cost of the people doing it.

Q4.3 — When would you reach for DORA, and when for SPACE?¶

Testing: Practical judgment about scope and purpose.

A. Reach for DORA when the question is specifically "how good is our software delivery, and how do we compare?" — it's narrow, well-benchmarked, mostly system-measurable, and gives you industry reference points (elite/high/medium/low). It's the right starting instrument for a team improving its delivery pipeline because it's concrete and hard to argue with. Reach for SPACE when the question is the broader one — "is our engineering organization actually productive and sustainable?" — which forces you to look beyond delivery to wellbeing, collaboration, and flow. In practice you don't choose: you use DORA as the delivery-performance core and embed it inside a SPACE-shaped program so you're also watching the dimensions DORA can't see. A common and sensible rollout is to start with DORA (tractable, benchmarked), then broaden to SPACE once leadership wants a fuller, less gameable picture.

Q4.4 — Both DORA and SPACE come largely from Nicole Forsgren. Does that mean SPACE supersedes DORA?¶

Testing: Whether you understand the historical and conceptual relationship rather than assuming "newer = replacement."

A. No — newer doesn't mean replacement. DORA (popularized by Accelerate, 2018) and SPACE (2021) come from overlapping research lineages, and SPACE is best read as a generalization that contextualizes DORA, not an obsolescence of it. DORA remains the gold standard for the specific job of measuring and benchmarking delivery performance, with years of cross-industry data behind it. SPACE zooms out and says "delivery is one dimension of productivity; here's the full set, and here are the rules for measuring any of them without backfiring." So the relationship is nested, not sequential: SPACE is the wider lens; DORA is the sharpest instrument for one important region inside that lens. A team that drops DORA "because SPACE is newer" loses the benchmarked delivery signal and gains nothing.

Theme 5 — DevEx: Operationalizing SPACE¶

Q5.1 — What is DevEx (Developer Experience) and how does it relate to SPACE?¶

Testing: Whether you know DevEx is the operational, lived-experience layer beneath SPACE's measurement framing.

A. DevEx is the framework — from Abi Noda, Margaret-Anne Storey, Nicole Forsgren, and Michaela Greiler (2023) — that describes what actually shapes developers' day-to-day experience and therefore their productivity, organized around three core factors: feedback loops, cognitive load, and flow state. Where SPACE tells you how to measure productivity across dimensions, DevEx tells you what to improve to move those measurements — it's the operational, lived-experience layer. The connection is direct: improving the three DevEx factors moves SPACE's dimensions. Faster feedback loops improve Efficiency & flow and Satisfaction; lower cognitive load improves Flow and Performance; protected flow state improves Efficiency, Satisfaction, and ultimately delivery. So DevEx and SPACE are partners: SPACE is the measurement framework, DevEx is the diagnostic-and-intervention framework that explains why the numbers are what they are and what to do about them.

Q5.2 — Explain feedback loops, cognitive load, and flow state, and why each matters for productivity.¶

Testing: Real understanding of the three DevEx factors, not just naming them.

A. - Feedback loops — the speed and quality of responses to a developer's actions: how fast tests run, how long CI takes, how quickly a code review comes back, how soon you learn a change worked in production. Slow loops are the silent productivity killer because they force context switching — you start something else while waiting, then pay the cost of reloading the original context. Tightening loops is often the highest-ROI DevEx investment. - Cognitive load — the mental effort required to get work done: how much you must hold in your head to make a change. It's inflated by complex codebases, poor or missing documentation, confusing tooling, and unclear ownership. High cognitive load slows everything and burns people out; reducing it (clear docs, good abstractions, golden paths) is a force multiplier. - Flow state — the deep-focus condition where productive work actually happens. It requires uninterrupted time; it's destroyed by frequent meetings, interruptions, and the fragmentation that slow feedback loops cause. Protecting flow (focus blocks, async communication, fewer handoffs) directly raises both output and satisfaction.

Each matters because together they explain the mechanism behind productivity: friction in any of the three shows up as lower flow, more frustration, and worse delivery — which is exactly what SPACE measures.

Q5.3 — A leader asks, "We measured DevEx — now what do we actually change?" What's your answer?¶

Testing: Whether you can turn the DevEx model into concrete interventions.

A. You translate the three factors into concrete levers. For feedback loops: cut CI time (parallelize, cache, split test suites), make local test runs fast, set and enforce a code-review SLA so PRs don't rot, and shorten the path from merge to production signal. For cognitive load: invest in documentation discoverability, build "golden paths"/paved roads so the common case is trivial, clarify service ownership, and pay down the complexity that forces people to hold too much in their heads. For flow state: protect focus time with no-meeting blocks, push communication async, reduce handoffs and wait states in the workflow, and resist interruption-driven culture. The discipline is to let the measurements point at the worst factor first — if surveys say the build is the top frustration and workflow data confirms long CI waits, you've got a prioritized, evidence-backed target rather than a guess. DevEx is valuable precisely because it converts "developers seem unhappy" into a specific, fixable bottleneck.

Q5.4 — How does DevEx avoid the trap of being just another vanity dashboard?¶

Testing: Whether you carry the anti-gaming, perceptual-validity thinking into DevEx too.

A. By being grounded in perceptual data tied to specific, fixable friction, and by being used diagnostically rather than as a scoreboard. DevEx surveys ask developers directly about feedback loops, cognitive load, and flow — lived experience, which is the most reliable read on actual day-to-day productivity and is predictive of outcomes. It avoids vanity by (1) combining that perceptual data with system and workflow data (the SPACE metric-type rule again), so a complaint is corroborated by telemetry before you act; (2) staying at the team/system level, never ranking individuals; and (3) closing the loop — you measure, you fix the worst friction, you re-measure to confirm it moved. The failure mode to avoid is treating a DevEx score as a number to maximize for its own sake; the correct use is as a diagnostic that points at what to fix next. Used that way, it's the opposite of vanity — it's the most actionable layer in the whole stack.

Theme 6 — Scenario and Judgment¶

Q6.1 — An executive says, "I want to measure developer productivity by commits per developer." What do you say?¶

Testing: The flagship scenario — can you push back firmly, correctly, and constructively, without being dismissive of the underlying need?

A. I'd start by honoring the real goal — understanding and improving productivity is legitimate — and then explain why commits specifically will actively work against it. Commits are pure Activity: they measure typing, not value. Three concrete failures. First, it's instantly gameable — split one commit into ten and your "productivity" multiplies with zero added value; the moment people know it's measured, the number decouples from reality (Goodhart). Second, it punishes your best work and your best people — deleting dead code, mentoring, careful design, and unblocking teammates all generate few or zero commits while being high-leverage; a senior who removes 2,000 lines scores as your least productive engineer. Third, it's individual, which corrodes the collaboration that actually drives delivery. Then I'd offer the constructive alternative: SPACE — pick a handful of metrics across at least three dimensions (e.g., delivery throughput and change failure rate and developer satisfaction), mix perceptual with system data, and scope it to teams. That gives a picture that's hard to game and tied to outcomes, instead of a single number that will be gamed by Friday and will quietly degrade quality and morale.

Q6.2 — Then how would you measure a team's productivity well? Give me your actual approach.¶

Testing: Whether you can design a real, defensible measurement program, not just critique the bad one.

A. I'd build a small SPACE-shaped scorecard, deliberately spanning dimensions and types, scoped to the team: 1. Start from the question. What do we actually want to improve — delivery speed, quality, sustainability? That decides the metrics. 2. Sample at least three dimensions in tension. A defensible default: Performance (change failure rate / reliability against SLOs), Efficiency & flow (lead time for changes, or focus-time/handoff data), and Satisfaction & wellbeing (developer satisfaction and burnout via survey). Optionally add a Communication signal (review turnaround). I deliberately include a human dimension so we can't "win" by burning people out. 3. Mix metric types — system telemetry (deploy frequency, build time), workflow (wait times), and perceptual (surveys). Triangulate. 4. Keep it team-level and improvement-oriented. Never rank individuals; the data is for the team to see its own system and improve it. I'd be explicit, loudly, that it's not for performance reviews. 5. Treat metrics as conversation-starters, not verdicts. A number that moves prompts "why?", not an automatic reward or punishment. 6. Re-measure and iterate. Watch trends, kill metrics that get gamed, and confirm interventions actually moved the perceptual numbers.

That's an honest, hard-to-game, outcome-linked program — the opposite of a single activity count.

Q6.3 — You discover your team's activity dashboard is being used by management to rank and stack people. What do you do?¶

Testing: Ethical and organizational judgment under pressure — the situation SPACE most explicitly warns against.

A. This is the exact misuse SPACE warns against, and it's actively harmful, so I'd act. First, name the harm concretely to the people doing it: ranking on activity metrics is statistically meaningless (activity isn't output or outcome), it's trivially gamed, and — most damaging — it destroys the collaboration the team depends on, because helping a teammate now lowers your own rank. It will reliably produce worse outcomes while looking precise, and it punishes exactly the senior, high-leverage people you most want to keep. Second, offer the correct alternative, not just a complaint: move to team-level SPACE metrics used for improvement, and decouple measurement entirely from individual evaluation. Third, if the ranking continues anyway, treat the dashboard itself as the liability: I'd push to stop exposing per-individual activity data, because a metric that's being weaponized is worse than no metric — its existence guarantees the misuse. Throughout, I'd frame it not as "be nice to engineers" but as "this measurement is invalid for the purpose it's being used for and will degrade the very thing you're trying to improve." That's an argument leaders can actually hear.

Q6.4 — Leadership wants "one number" for an executive dashboard. They won't accept five. How do you handle the constraint?¶

Testing: Pragmatism — can you serve a real organizational constraint without betraying the framework?

A. I'd push back on the framing while meeting the underlying need, because the underlying need (a quick read for executives) is legitimate even though "one raw metric" is the wrong solution. Options, in order of preference. First, negotiate to a small panel — three or four metrics across dimensions fits on one slide and is barely more cognitive load than one number, while being vastly more honest; most "we need one number" requests actually mean "we need it to fit on a slide." Second, if they truly insist on a single headline, make it a composite that's hard to game — e.g., lead the dashboard with delivery performance but require it to be shown alongside a guardrail metric (change failure rate, developer satisfaction) so the number can't move favorably while a guardrail collapses. Third and non-negotiable: whatever the headline, I'd refuse to make it an individual activity metric, and I'd attach a one-line caveat that it's a team improvement signal, not a ranking tool. The goal is to give executives a fast read without handing them a single gameable proxy that will distort the whole org.

Q6.5 — A team's velocity (story points) is up 40% this quarter. The VP is delighted. What's your read?¶

Testing: Whether you instinctively triangulate and distrust a lone activity-flavored metric that's "good news."

A. I'd be skeptical, precisely because it's a single number moving in the convenient direction. Velocity is essentially an Activity measure dressed as output, and it's one of the most gameable metrics there is — point inflation, splitting stories, sandbagging estimates, all raise velocity with zero added value, and the incentive to do so grows the moment leadership celebrates it (Goodhart, live). So 40% up tells me almost nothing on its own; I'd immediately triangulate against other dimensions. Did Performance hold — change failure rate, defect escape, reliability? Did Satisfaction hold, or did the team sprint into burnout? Did Flow/Efficiency improve or did people just work longer? If velocity is up and quality held and satisfaction held, that's a genuine improvement worth understanding and sustaining. If velocity is up while change failure rate climbed and the team is exhausted, the "win" is borrowed against next quarter. The senior move is to refuse to read a lone activity-flavored metric as good news, and to ask what the other dimensions did before celebrating.

Theme 7 — Measurement Validity¶

Q7.1 — Are perceptual (survey) metrics actually valid? Defend or refute.¶

Testing: The deepest conceptual point in the framework — whether you treat developer self-report as real data.

A. They're valid — and arguing otherwise misunderstands measurement. Perceptual data is real data. Subjective states like satisfaction, perceived productivity, and felt friction are measured the way every serious field measures subjective states: with validated instruments, asked consistently, analyzed as trends. Crucially, they're often the most predictive signal available: a developer's perception of their own productivity correlates strongly with — and frequently leads — objective outcomes, and self-reported burnout predicts attrition and delivery decline before the system metrics turn. There's also a coverage argument: perceptual metrics capture things no log can — "the build is so slow I dread shipping," "I can never find the docs I need" — which are exactly the productivity bottlenecks. The legitimate caveats are about method, not validity: surveys must be well-designed (validated questions, consistent wording), anonymized, trended rather than snapshotted, and never tied to individual compensation. Done right, perceptual metrics are not a soft consolation for what you can't instrument — they're frequently the strongest and earliest signal you have.

Q7.2 — What is the McNamara fallacy, and how does it apply to developer-productivity measurement?¶

Testing: Whether you know the canonical failure mode for soft dimensions and can apply it.

A. The McNamara fallacy is the error of making decisions based only on what's easily quantifiable, while progressively dismissing what's hard to measure — first ignoring the unmeasured, then declaring it unimportant, then assuming it doesn't exist. It's named for Robert McNamara's reliance on body counts in Vietnam while the decisive but unquantified factors went unweighted. In developer-productivity terms it's pervasive and dangerous: it's exactly why organizations grab commits, lines, and velocity — they're easy to count — and ignore satisfaction, collaboration quality, and flow because those are "soft." The fallacy then completes itself: "we can't put a number on developer happiness, so let's not factor it in," which slides into "so it must not matter." SPACE is in large part a direct rebuttal to the McNamara fallacy: it insists the hard-to-measure human dimensions (Satisfaction, Communication, Flow) are first-class, and that the right response to "this is hard to measure" is to measure it well — with perceptual instruments — not to drop it. Naming this fallacy unprompted in an interview signals you understand the framework's philosophical core, not just its acronym.

Q7.3 — How do you run developer surveys so the data is trustworthy?¶

Testing: Practical rigor — whether you can make perceptual measurement actually work.

A. The validity of perceptual data lives or dies on method, so: use validated, consistent questions — don't reword them quarter to quarter, or you can't compare trends; reuse instruments with research behind them (DevEx, SPACE example questions) rather than inventing wording. Anonymize, and make the anonymity credible — if people fear attribution, they answer politically and the data is worthless. Track trends, not snapshots — a single reading is noise; the signal is the direction over time and the deltas after you change something. Keep it short and regular to avoid fatigue — a focused recurring pulse beats an annual monster survey. Never tie results to individual evaluation or compensation — the instant a survey affects someone's review, it stops measuring reality and starts measuring what people think you want to hear (Goodhart again, on the survey itself). And close the loop: show the team what changed because of their answers, or response rates and honesty both collapse. Done this way, surveys are a rigorous instrument; done carelessly, they're theater.

Q7.4 — Even a perfect SPACE program can be gamed. Where are the residual risks, and how do you mitigate them?¶

Testing: Intellectual honesty — whether you'll acknowledge the framework's limits rather than oversell it.

A. Yes — no metric system is immune, and pretending otherwise is a red flag. Residual risks: (1) the cross-dimensional defense weakens if your dimensions aren't truly in tension — pick three metrics that all move together and you've rebuilt a single-dimensional system in disguise; mitigate by deliberately including a guardrail/human dimension that would regress under gaming. (2) Surveys can be gamed or go political if anonymity or decoupling from evaluation slips; mitigate with the method discipline above. (3) Metrics drift into targets over time even when introduced as diagnostics — the org starts chasing the number; mitigate by periodically reviewing whether each metric is still informative or now gamed, and retiring the gamed ones. (4) Local optimization — a team improves its own SPACE numbers in ways that hurt the broader system; mitigate by including some cross-team/flow signals. The honest framing is that SPACE reduces gameability dramatically by triangulating dimensions and types, but the real safeguard is cultural: using the data for learning and improvement rather than reward and punishment. The framework makes gaming hard; the culture makes it pointless. A candidate who says "SPACE solves Goodhart" doesn't understand Goodhart; one who says "SPACE mitigates it and culture finishes the job" does.

Rapid-Fire Round¶

Short questions to check breadth. One or two sentences each.

Q: What does SPACE stand for? A: Satisfaction & wellbeing, Performance, Activity, Communication & collaboration, Efficiency & flow.
Q: Who created it and when? A: Forsgren, Storey, and colleagues, 2021.
Q: The single most important rule? A: Use metrics from at least three dimensions — never a single number.
Q: Why is Activity dangerous? A: It's the easiest to measure and the most gameable; it counts motion, not progress or outcomes.
Q: Output vs outcome in one line? A: Output is what you produced (PRs, features); outcome is whether it did its job — Performance is about outcomes.
Q: The three metric types? A: Perceptual (surveys), system (telemetry), workflow (process) — combine them.
Q: Team or individual? A: Team — individual metrics invite ranking, gaming, and the death of collaboration.
Q: How does SPACE relate to DORA? A: Complementary — DORA measures delivery; SPACE is the broader productivity framework that also covers wellbeing and flow.
Q: Where do DORA's metrics fit in SPACE? A: Deployment frequency = Activity, lead time = Efficiency/flow, change failure rate & recovery = Performance.
Q: The three DevEx core factors? A: Feedback loops, cognitive load, flow state.
Q: DevEx vs SPACE in one line? A: SPACE is how to measure productivity; DevEx is what to improve to move it.
Q: Is perceptual data valid? A: Yes — it's real, often the most predictive signal, and captures friction no log can.
Q: What's the McNamara fallacy? A: Deciding only on what's easily measured and dismissing the hard-to-measure as unimportant — the trap SPACE rebuts.
Q: Why does a single metric fail? A: Productivity is multidimensional, and Goodhart's law means a lone target gets gamed instead of achieved.
Q: One-line answer to "measure productivity by commits"? A: No — that's gameable Activity that punishes your best people; use SPACE across three-plus dimensions instead.

Red Flags and Green Flags¶

What interviewers infer from how you answer, not just whether you're right.

Red flags: - Reciting the five letters but unable to explain why one metric fails (no Goodhart, no multidimensionality). - Treating SPACE as an off-the-shelf dashboard — "track these five numbers." - Defending commits/lines/velocity as productivity measures, or not flagging Activity as the trap. - Dismissing surveys as "soft" / "you can't measure feelings" — committing the McNamara fallacy live. - Endorsing individual ranking, or not objecting when a scenario weaponizes it. - Calling SPACE a replacement for DORA, or unable to relate the two. - Claiming SPACE "solves" gaming with no acknowledgment of residual risk or culture.

Green flags: - Naming the failure mode — Goodhart, the McNamara fallacy, gaming — before reaching for a metric. - Defending perceptual data as real, predictive, and methodologically rigorous. - Insisting on three-plus dimensions in tension, and mixing metric types, unprompted. - Holding the team-not-individual line and explaining the structural reasons. - Framing DORA and SPACE as nested and complementary, mapping DORA into SPACE concretely. - Using DevEx (feedback loops, cognitive load, flow) to explain what to fix, not just what to measure. - Pushing back on "one number" constructively — offering a small guardrailed panel instead. - Admitting the limits and pointing to culture as the real safeguard.

Summary¶

The bank reduces to four moves, repeated in costumes: productivity is multidimensional, activity ≠ output ≠ outcome, perceptual data is real data, and measure teams, not individuals. Name the failure mode first; the metric follows.
SPACE = Satisfaction & wellbeing, Performance, Activity, Communication & collaboration, Efficiency & flow — a framework for choosing metrics, not a fixed dashboard. The guidance: pick ≥3 dimensions in tension, combine perceptual/system/workflow types, and stay at the team level.
Activity is the trap (easy to measure, gameable, punishes high-leverage work); Performance is the hardest (outcomes, not output — attribution and lag make it slippery, so teams use proxies honestly).
SPACE vs DORA: complementary, nested. DORA measures delivery and is well-benchmarked; SPACE is the broader productivity lens that adds the human dimensions DORA omits (wellbeing, collaboration). DORA's four metrics map into SPACE's Activity, Efficiency/flow, and Performance.
DevEx (feedback loops, cognitive load, flow state) operationalizes SPACE — it explains what to improve to move the numbers, grounded in perceptual data used diagnostically.
Validity: perceptual metrics are valid, often the most predictive, and capture friction no telemetry can — the right answer to "hard to measure" is to measure it well (rebutting the McNamara fallacy), not to drop it. No system is un-gameable; SPACE reduces gaming via triangulation, and a learning-not-ranking culture finishes the job.