Taint & Data-Flow Analysis — Senior Level¶

Roadmap: Static Analysis → Taint & Data-Flow Analysis

Lattices, monotone frameworks, the precision/scalability frontier, and why no tool can be both sound and complete.

Table of Contents¶

Introduction
Prerequisites
Glossary
Core Concept 1 — The Monotone Framework: Lattice + Transfer + Fixpoint
Core Concept 2 — A Worked One-Step Lattice Example
Core Concept 3 — Forward/Backward, May/Must, and Taint's Place in the Grid
Core Concept 4 — The Precision Dimensions and Their Cost
Core Concept 5 — Soundness vs Completeness and Rice's Theorem
Core Concept 6 — CodeQL as a Data-Flow Engine
Core Concept 7 — Interprocedural Mechanics: Call Graphs, Summaries, k-CFA
Core Concept 8 — The Engines Compared
Real-World Examples
Mental Models
Common Mistakes
Test Yourself
Cheat Sheet
Summary
Further Reading
Related Topics

Introduction¶

Focus: the formal theory — lattices, monotone transfer functions, fixpoint iteration — and the engineering consequences: the precision dimensions, the soundness/completeness impossibility, and how CodeQL turns all of it into queryable facts.

A senior owns the rule set and must reason about what the tool can and cannot prove. That requires the actual theory, not the metaphor. Data-flow analysis is an instance of abstract interpretation: instead of running the program on concrete values, you run it on abstract values drawn from a lattice, using monotone transfer functions, and iterate to a fixpoint that over-approximates every possible execution. Once you have that frame, every practical knob — flow-, path-, context-, field-sensitivity — and every limitation — Rice's theorem, soundness vs completeness — falls into place.

Prerequisites¶

The middle tier: CFG, transfer functions, intra/interprocedural, summaries, Semgrep taint mode.
Comfort with basic discrete math: partial orders, sets, monotonicity.
Exposure to at least one SAST tool's query/rule language (CodeQL or Semgrep).
Helpful: the Formal Methods & Verification section for the proof-theoretic neighbors.

Glossary¶

Term	Meaning
Lattice	A partially ordered set where every pair of elements has a least upper bound (join, ⊔) and greatest lower bound (meet, ⊓).
Join (⊔)	The merge operator at CFG confluence points; ⊔ = union for may-analyses.
Monotone	A transfer function `f` such that `x ⊑ y ⇒ f(x) ⊑ f(y)` — more info in, no less info out.
Fixpoint	A solution `X` with `F(X) = X`; iteration converges to the least fixpoint for over-approximation.
Abstract interpretation	Executing a program over abstract values that soundly over-approximate concrete behavior.
Flow-sensitive	Distinguishes facts at different program points (respects statement order).
Path-sensitive	Distinguishes facts per execution path / branch condition.
Context-sensitive	Distinguishes a function's behavior per call site / calling context.
Field-sensitive	Distinguishes taint per object field rather than per whole object.
Sound	No false negatives — every real bug of the class is reported (over-approximation).
Complete	No false positives — every reported bug is real (under-approximation).
Soundy	Pragmatically sound except for a documented, well-understood set of unsound features.
k-CFA	Context-sensitive call analysis bounded to the last `k` call sites.

Core Concept 1 — The Monotone Framework: Lattice + Transfer + Fixpoint¶

A monotone data-flow framework has four pieces:

A lattice L of abstract facts, with a partial order ⊑ ("at least as much information as") and a join ⊔ for merging facts at confluence points. For taint, the lattice over a single variable is the two-point lattice {Clean ⊑ Tainted} with ⊔ = "tainted if either is tainted". For the whole program state, L is the powerset of (variable, tainted) facts, ordered by ⊆.
Transfer functions f_B : L → L, one per basic block, each monotone: feeding in more taint never produces less taint out. Monotonicity is what guarantees the iteration terminates on a finite-height lattice.
A direction. Forward analyses compute out[B] = f_B(in[B]), in[B] = ⊔ out[P] over predecessors P. Backward analyses reverse it.
Fixpoint iteration. Initialize, then repeatedly apply transfer functions and joins until nothing changes. The result is the least fixpoint — the smallest set of facts consistent with the equations, which is exactly the most precise sound over-approximation the abstraction allows.

  repeat:
    for each block B:
      in[B]  = ⊔ { out[P] : P → B }        # join over predecessors
      out[B] = transfer_B(in[B])           # apply transfer function
  until no in[]/out[] changed               # fixpoint reached

Termination is not an accident: monotone transfer functions over a finite-height lattice can only increase the fact set, which is bounded, so iteration must stabilize. This is the theoretical bedrock under every taint tool, the Dragon Book's classical analyses, and abstract interpretation in general.

Core Concept 2 — A Worked One-Step Lattice Example¶

Take the merge-point CFG from the middle tier. Lattice element = set of tainted variables; ⊑ = ⊆; ⊔ = ∪.

B0:  x = req.args["q"]      transfer: in ∪ {x}            (source adds x)
B1:  x = sanitize(x)        transfer: in \ {x}            (sanitizer removes x)
B2:  db.execute(x)          transfer: identity; CHECK x ∈ in → finding

Iterate (one variable, so the lattice has height 1 → converges in one pass):

  out[B0] = {} ∪ {x}            = {x}                  # x tainted after source
  in[B1]  = out[B0]            = {x}
  out[B1] = {x} \ {x}          = {}                    # sanitizer cleared x
  in[B2]  = out[B0] ⊔ out[B1] = {x} ∪ {} = {x}         # JOIN over both paths
  out[B2] = {x}
  ──────────────────────────────────────────────
  CHECK at B2:  x ∈ in[B2] = {x}  →  FINDING (the unsanitized false-branch path)

The join at B2 is the crux: the false branch (skipping B1) contributes {x}, the sanitized branch contributes {}, and {x} ⊔ {} = {x}. The analysis reports the sink because some path reaches it tainted. This is correct — that path is genuinely exploitable — and it is the mechanical reason "sanitize on one branch" never satisfies a sound taint analysis. To clear the finding, the sanitizer must dominate the sink (lie on every path to it).

Core Concept 3 — Forward/Backward, May/Must, and Taint's Place in the Grid¶

Two orthogonal axes classify any data-flow analysis:

	May (⊔ = ∪, "on some path")	Must (⊔ = ∩, "on all paths")
Forward	Taint, reaching definitions	available expressions
Backward	live variables (may-live)	very-busy expressions

Forward because taint moves with execution: a value is tainted at point p if it was tainted at some predecessor and not cleaned. (Live variables, by contrast, are backward — usefulness flows from future uses back to definitions.)
May because the safe-but-imprecise default for security is "report if tainted on any path." Using ∩ (must) would only report values tainted on every path — fewer alarms, but missing the single dangerous branch. May-analysis trades false positives for the guarantee of catching the bug. This choice — ⊔ = ∪ — is the soundness bias made concrete in the merge operator.

Knowing taint is "forward + may" instantly tells you its merge behavior, its convergence direction, and why it errs toward over-reporting.

Core Concept 4 — The Precision Dimensions and Their Cost¶

Every taint engine sits somewhere on four precision axes. Each one moves you toward fewer false positives and toward worse scalability. The whole craft is choosing where to sit.

Dimension	Imprecise (cheap)	Precise (costly)	What it fixes
Flow-sensitivity	ignores statement order	respects order; facts per program point	`x` cleaned then reused — flow-insensitive sees it as "ever tainted"
Path-sensitivity	merges branches	tracks branch conditions (often via SMT)	"tainted only when `if (unsafe)`" — kills infeasible-path FPs
Context-sensitivity	one summary for all callers	per-call-site behavior	a helper tainted by one caller marked tainted for all
Field-sensitivity	whole object tainted if any field is	per-field taint	tainting `user.name` shouldn't taint `user.id`

        precision  ───────────────────────────────►  fewer false positives
            │
   scalability │  ◄───────────────────────────────  whole-program sound analysis
            │      does NOT scale; tools approximate everywhere

Why whole-program sound analysis doesn't scale. Full path-sensitivity is exponential in branches; full context-sensitivity is exponential in call depth; precise field-, array-, and alias-sensitivity compound it. A truly sound, fully precise, whole-program analysis is intractable on real codebases. So every production tool bounds something: k-limited context (k-CFA), access-path-limited field sensitivity, no path-sensitivity, heuristic alias analysis. The art is spending precision only where it buys signal — usually flow- and field-sensitivity (cheap, high payoff) and limited context-sensitivity, while skipping general path-sensitivity.

Core Concept 5 — Soundness vs Completeness and Rice's Theorem¶

The two words that govern what any analysis can promise:

Sound = no false negatives. Every real vulnerability of the class is reported. Achieved by over-approximating: when unsure, assume tainted. Cost: false positives.
Complete = no false positives. Every reported finding is a genuine bug. Achieved by under-approximating: only report when certain. Cost: false negatives (missed bugs).

  reality:   { actual vulnerabilities }
  SOUND tool:      reports ⊇ reality     (catches all, plus extras = false positives)
  COMPLETE tool:   reports ⊆ reality     (all real, but misses some = false negatives)

You cannot have both, decidably. Rice's theorem says any non-trivial semantic property of programs (and "is this value attacker-controlled at this sink on some real execution?" is one) is undecidable. No algorithm can answer it exactly for all programs. Therefore every taint tool must approximate — and approximation means trading false positives against false negatives. There is no precise tool, only differently-biased ones.

How real tools choose. Most security SAST leans toward soundness (catch the bug) but isn't fully sound, because true soundness over reflection/eval/native code would make it unusably noisy. They land in the pragmatic middle nicknamed "soundy": sound except for an explicit, documented list of swept-under features (reflection, dynamic dispatch, some library calls, implicit flows). A senior must know which unsound assumptions a given tool makes, because those are exactly the bugs it will silently miss. "Soundy" is honesty about the gap, not a cure for it.

Core Concept 6 — CodeQL as a Data-Flow Engine¶

CodeQL compiles your codebase into a relational database of program facts — AST nodes, the CFG, the call graph, and a precomputed data-flow graph — then lets you run declarative queries (a Datalog-like language) over it. Taint tracking is a library on top: you extend a configuration declaring sources, sinks, and additional taint steps/sanitizers, and CodeQL computes interprocedural reachability for you.

/**
 * @name User input flows into SQL execution
 * @kind path-problem
 * @problem.severity error
 */
import python
import semmle.python.dataflow.new.TaintTracking
import semmle.python.dataflow.new.DataFlow

module SqlInjectionConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    // any read of an untrusted request attribute
    exists(Attribute a |
      a = source.asExpr() and
      a.getObject().(Name).getId() = "request" and
      a.getName() in ["args", "form", "values", "GET", "POST"]
    )
  }

  predicate isSink(DataFlow::Node sink) {
    // first argument of db.execute(...)
    exists(Call c |
      c.getFunc().(Attribute).getName() = "execute" and
      sink.asExpr() = c.getArg(0)
    )
  }

  predicate isBarrier(DataFlow::Node node) {
    // a sanitizer: int() cast or our vetted parameterizer
    exists(Call c |
      c.getFunc().(Name).getId() in ["int", "parameterize"] and
      node.asExpr() = c
    )
  }
}

module SqlFlow = TaintTracking::Global<SqlInjectionConfig>;

from SqlFlow::PathNode src, SqlFlow::PathNode sink
where SqlFlow::flowPath(src, sink)
select sink.getNode(), src, sink, "Untrusted data reaches SQL execution."

The three predicates — isSource, isSink, isBarrier (the sanitizer) — are exactly the junior-tier triad, now backed by a global, interprocedural, field-sensitive data-flow engine. @kind path-problem makes CodeQL emit the full witness path from source to sink, which is what turns a finding into something a reviewer can act on. Writing and maintaining these queries — and the framework models that feed them — is the org capability covered in professional.md.

Core Concept 7 — Interprocedural Mechanics: Call Graphs, Summaries, k-CFA¶

Interprocedural taint needs a call graph (who calls whom). Building it precisely is itself undecidable in the presence of dynamic dispatch and function values, so tools approximate (class-hierarchy analysis, rapid type analysis, points-to-based). Holes in the call graph = missed flows.

Two ways to compute flow across the graph:

Summary-based (functional approach). Compute, per function, a relation from input taint to output taint (return value, mutated parameters, mutated globals). Reuse at every call site. This is the IFDS/IDE framework (Reps–Horwitz–Sagiv) reduced to graph reachability — the reason CodeQL and Infer scale.
Cloning / call-strings (k-CFA). Distinguish calling contexts by the last k call sites. k=0 is context-insensitive (cheap, FPs); larger k is more precise and exponentially costlier. Most tools use small k or access-path-based context.

The tension: context-insensitivity merges callers, so one tainted caller of a shared utility poisons every caller's verdict (false positives); context-sensitivity fixes it at real cost. This is why a shared logging/formatting helper is a notorious false-positive generator in cheap analyses.

Core Concept 8 — The Engines Compared¶

Tool	Analysis style	What it's strong at
CodeQL	Whole-program, interprocedural, field-sensitive data flow over a code DB	Deep taint, custom queries, path witnesses; the reference for serious SAST
Semgrep (taint mode)	Pattern-driven, mostly intraprocedural + shallow interprocedural	Fast, readable rules, great for org-specific patterns; less deep
Infer	Separation-logic / abstract interpretation, compositional summaries	Null-deref, resource/memory leaks, concurrency; bi-abduction at scale
Pysa (Pyre)	Interprocedural taint for Python with model files	Python taint at Meta scale; explicit source/sink model `.pysa` files
Checker Framework	Pluggable types — taint as a type qualifier (`@Tainted`/`@Untainted`)	Compile-time, sound within the type discipline; developer-annotated

Note the spectrum: from type-based (Checker Framework: taint is a type, checked soundly but requiring annotations) through data-flow (CodeQL, Pysa) to pattern + flow (Semgrep) to separation logic (Infer). Choosing among them is a professional-tier cost/signal decision; understanding what class of analysis each performs is the senior prerequisite for that decision.

Real-World Examples¶

The infeasible-path false positive. A flow-insensitive analysis reported taint that, on any real execution, was guarded by a prior if (validated) return. Only path-sensitivity (or a barrier model of the validator) clears it — and the team chose to model the validator rather than pay for path-sensitivity.
Context-insensitivity poisoning a logger. A central format(msg) helper, called once with a tainted value, was marked taint-propagating everywhere; the report listed 200 "findings" all flowing through format. Adding context-sensitivity (or a precise summary that didn't route taint to the log sink) collapsed it to the one real case.
Soundy gap exploited. A deserialization gadget (pickle.loads) reintroduced attacker-controlled objects the analysis treated as clean — the documented unsound corner. The fix was a model declaring pickle.loads a source.
Type-based prevention. A Java service annotated its query API parameters @Untainted; the Checker Framework rejected, at compile time, any code passing a @Tainted value — soundness within the discipline, paid for in annotation effort.

Mental Models¶

Abstract interpretation = running the program in a fog. You replace concrete values with abstract ones (Clean/Tainted) and accept that the fog makes you over-cautious — that over-caution is soundness.
The lattice is the vocabulary; the transfer functions are the grammar; the fixpoint is the only sentence the program can say about itself.
Precision is a budget, not a goal. Spend it where false positives actually hurt; flow- and field-sensitivity are usually the best value.
Rice's theorem is the speed limit. Any tool promising "no false positives and no false negatives" is lying or solving a trivial sub-problem.
"Soundy" is a confession. Read the tool's list of unsound assumptions; that list is your exact false-negative surface.

Common Mistakes¶

Believing a tool is "sound." Almost none are, fully. Find and trust the documented soundy gaps instead of assuming completeness.
Buying path-sensitivity you don't need. It's the most expensive axis; a barrier model of the guard often clears the same FPs far cheaper.
Ignoring call-graph holes. Dynamic dispatch and function values silently drop edges; your interprocedural analysis is only as good as its call graph.
Confusing field-insensitivity FPs for real bugs. "Whole object tainted" can fire on a clean field; check whether the specific field carries taint.
Treating the join operator as a detail. ⊔ = ∪ vs ∩ is the difference between a may and a must analysis — it changes what the tool means by a finding.

Test Yourself¶

State the four components of a monotone data-flow framework and why monotonicity guarantees termination.
Run the one-step lattice example and explain why the join at B2 yields a finding.
Place taint in the forward/backward × may/must grid and justify each axis.
Rank the four precision dimensions by cost and give a false positive each one fixes.
Define sound and complete in terms of false negatives/positives; state why Rice's theorem forbids both.
What does "soundy" mean, and how do you find a given tool's unsound surface?
In the CodeQL config, which predicate is the sanitizer, and what does @kind path-problem buy you?
Why does context-insensitivity turn shared helpers into false-positive factories?

Cheat Sheet¶

MONOTONE FRAMEWORK   lattice L (⊑, ⊔) + monotone transfer f_B + direction + fixpoint
  termination       monotone f over finite-height lattice ⇒ iteration stabilizes
  result            LEAST fixpoint = most precise SOUND over-approximation

TAINT'S CLASS        forward + may  → merge with ⊔ = ∪  → tainted on ANY path
ONE-STEP            out=in∪{x} at source; in\{x} at sanitizer; check x∈in at sink
                    sanitizer must DOMINATE the sink to clear a finding

PRECISION AXES (cheap→precise, each ↑ cost ↓ false positives)
  flow-sensitive    respects statement order        (best value)
  field-sensitive   per-field taint                  (best value)
  context-sensitive per-call-site (k-CFA)            (moderate cost)
  path-sensitive    per branch condition / SMT       (most expensive)

SOUND  = no false negatives (over-approx, has FPs)
COMPLETE = no false positives (under-approx, has FNs)
RICE'S THEOREM ⇒ can't have both decidably ⇒ all tools approximate
SOUNDY = sound except a documented list (reflection/eval/native/implicit) ← your FN surface

CODEQL   code → relational DB → queries; isSource/isSink/isBarrier; path-problem = witness
ENGINES  CodeQL deep DF · Semgrep pattern+shallow DF · Infer sep-logic · Pysa Python DF · Checker Framework pluggable types

Summary¶

Data-flow analysis is abstract interpretation in the monotone framework: a lattice of facts, monotone transfer functions, a direction, and a fixpoint that over-approximates all executions. Taint is its forward, may instance — the join is union, so a value tainted on any path is reported, and a sanitizer clears a finding only when it dominates the sink. Real tools live on a four-axis precision frontier (flow-, path-, context-, field-sensitivity); buying precision means losing scalability, and whole-program sound-and-precise analysis is intractable, so everyone approximates. Rice's theorem makes the soundness/completeness trade-off mandatory: no tool is both, decidably, and most security SAST is soundy — sound except a documented gap that is its false-negative surface. CodeQL operationalizes all of this as queries over a code database with isSource/isSink/isBarrier, while Semgrep, Infer, Pysa, and the Checker Framework occupy different points on the analysis spectrum. The professional tier turns this understanding into an org capability.