Taint & Data-Flow Analysis — Middle Level¶

Roadmap: Static Analysis → Taint & Data-Flow Analysis

The control-flow graph, propagation rules, and the jump from one function to the whole program.

Table of Contents¶

Introduction
Prerequisites
Glossary
Core Concept 1 — The Control-Flow Graph (CFG)
Core Concept 2 — Data-Flow as Facts That Travel the CFG
Core Concept 3 — Taint Propagation Rules
Core Concept 4 — Intraprocedural vs Interprocedural
Core Concept 5 — Sanitizers, Implicit Flows, and Where Tools Give Up
Core Concept 6 — Semgrep Taint Mode in Practice
Real-World Examples
Mental Models
Common Mistakes
Test Yourself
Cheat Sheet
Summary
Further Reading
Related Topics

Introduction¶

Focus: modelling a program as a control-flow graph, watching taint travel along it, and understanding why crossing function boundaries (interprocedural analysis) is where both the power and the cost live.

At the junior tier, taint was an intuition: untrusted data flows to a dangerous place. At this tier we make "flows" mechanical. A program is a graph of basic blocks (the control-flow graph). A taint analysis is a procedure that pushes a set of facts — "variable x is tainted here" — along that graph until nothing changes. That's the engine. Once you see it, you understand why intraprocedural analysis is cheap and shallow, why interprocedural analysis is expensive and powerful, and why every real tool makes compromises.

Prerequisites¶

You can read the junior tier: source / sanitizer / sink, the worked SQLi trace.
You can read code in Python, JS, Go, or Java and recognize function calls, branches, and loops.
You have run a SAST tool (Semgrep, CodeQL, Bandit) at least once — see SAST & Security Scanners.
Helpful: you know what an AST is — see Custom Lint Rules & AST.

Glossary¶

Term	Meaning
CFG	Control-flow graph — basic blocks as nodes, possible execution transitions as edges.
Basic block	A straight-line run of statements with one entry and one exit (no branches in the middle).
Data-flow fact	A piece of information attached to a program point (e.g. "`x` is tainted").
Transfer function	The rule for how one statement transforms the incoming facts into outgoing facts.
Forward analysis	Facts flow along execution direction (taint, reaching definitions).
Backward analysis	Facts flow against execution direction (live variables).
Intraprocedural	Analysis within a single function, no call following.
Interprocedural	Analysis across function calls, via a call graph and/or summaries.
Call graph	Which functions can call which — the map interprocedural analysis walks.
Summary	A compact "if tainted in, tainted out" description of a function, reused at every call site.
Implicit flow	Taint that leaks through control (a branch on tainted data) rather than direct assignment.

Core Concept 1 — The Control-Flow Graph (CFG)¶

Every data-flow analysis starts by turning code into a control-flow graph: nodes are basic blocks (straight-line code), edges are the jumps between them. Consider:

def f(req):
    x = req.args["q"]      # B0
    if admin(req):
        x = sanitize(x)    # B1
    log(x)                 # B2
    db.execute(x)          # B2

Its CFG:

        ┌──────────────────────┐
        │ B0:  x = req.args[q]  │   (source: x tainted)
        └──────────┬───────────┘
            true   │   false
        ┌──────────┴───────────┐
        ▼                      │
 ┌──────────────────┐          │
 │ B1: x=sanitize(x)│          │
 └────────┬─────────┘          │
          │                    │
          ▼                    ▼
        ┌──────────────────────┐
        │ B2:  log(x)          │
        │      db.execute(x)   │   (sink)
        └──────────────────────┘

The CFG makes branching explicit. Block B2 has two incoming edges: one path went through the sanitizer (B1), one didn't (the false branch). What happens at the merge point is the whole game — and it's where soundness decisions are made (Core Concept 5).

Core Concept 2 — Data-Flow as Facts That Travel the CFG¶

A data-flow analysis attaches a set of facts to each program point and computes how each statement changes them. For taint, the fact is "which variables are tainted." Each statement has a transfer function — given the taint set coming in, it produces the taint set going out:

  in[B]  ── transfer(B) ──►  out[B]

Classic textbook analyses are the same machinery with different facts:

Analysis	Fact tracked	Direction	What it answers
Reaching definitions	which assignments can reach here	forward	"could this var have been set there?"
Live variables	which vars are used later	backward	"is this assignment dead?"
Available expressions	which expressions are already computed	forward	"can I reuse this?"
Taint	which vars carry untrusted data	forward	"did untrusted data get here?"

Taint is a forward analysis (facts move with execution) and — as you'll see — a may analysis: a variable is treated as tainted if it's tainted on any incoming path. The rigorous lattice/fixpoint formulation of all of this lives in senior.md; here, hold the picture: facts in, statement transforms them, facts out, repeat until stable.

Core Concept 3 — Taint Propagation Rules¶

The analysis needs rules for how taint moves through each kind of statement. A representative rule set:

ASSIGNMENT     y = x            if x tainted → y tainted; else y clean (kills old taint)
CONCATENATION  z = a + b        z tainted if a OR b tainted
METHOD/PROP    z = x.foo()      z tainted if x tainted (default: methods propagate)
              z = x.field       z tainted if x.field tainted (field-sensitivity, see senior)
COLLECTION     list.add(x)      list "carries" taint if x tainted
SOURCE         x = req.param    x tainted (rule says this call is a source)
SANITIZER      y = escape(x)    y CLEAN regardless of x (rule says this call cleans)
SINK           exec(x)          if x tainted at this point → FINDING

Two subtleties that trip people up:

Assignment kills taint. y = "constant" makes y clean even if it was tainted before — the old value is gone. This is why flow-sensitivity (tracking facts per program point) matters.
Propagation is the default; cleaning is the exception. Tools assume most operations preserve taint (over-approximate) and only clear it at functions explicitly modeled as sanitizers. That bias is deliberate: it favors catching bugs (fewer false negatives) over silence.

Core Concept 4 — Intraprocedural vs Interprocedural¶

This is the central distinction of the topic.

Intraprocedural analysis stays inside one function. It's fast and simple but blind at call boundaries — it must assume something about every call. Either it assumes calls don't propagate taint (misses real bugs) or that they always do (drowns in false positives). The junior worked-trace bug — source in get_user, sink three calls deep in build_query — is invisible to a purely intraprocedural analysis.

Interprocedural analysis follows calls. Two strategies:

Inlining / call-graph walking — virtually expand callees at each call site. Precise but explodes in size; recursion needs special handling.
Function summaries — analyze each function once to produce a compact summary like "argument 0's taint flows to the return value", then reuse that summary at every call site.

  build_query summary:   taint(arg0)  ⇒  taint(return)
  lookup summary:        taint(arg0)  ⇒  taint(return)  [because it calls build_query]
  get_user:              source → lookup(arg) → db.execute(...)  ← FINDING via summaries

Summaries are how real tools (CodeQL, Infer, Pysa) scale to millions of lines: analyze functions bottom-up, reuse results. Interprocedural analysis is where taint analysis earns its keep — and where almost all the cost and complexity live.

A related cost dimension is context-sensitivity: does the analysis distinguish two different call sites of the same function? A context-insensitive analysis merges them — if helper is ever called with tainted input anywhere, it treats helper's output as tainted everywhere, causing false positives. Context-sensitivity fixes that at real expense (covered in senior.md).

Core Concept 5 — Sanitizers, Implicit Flows, and Where Tools Give Up¶

Merge points and sanitizers. Recall the CFG in Concept 1: B2 is reached by a sanitized path and an unsanitized path. A may analysis takes the union at merges — tainted on any path means tainted — so B2 sees x as tainted and reports the sink. That's correct: the false branch really is exploitable. Sanitizing on one branch isn't enough; you must sanitize on all paths to the sink (or, better, right before the sink).

Implicit (control-flow) taint. Taint can leak without a direct assignment:

secret = read_secret()        # tainted (sensitive)
if secret[0] == 'a':
    print("starts with a")    # leaks 1 bit, no assignment of `secret`

This is an implicit flow — information flows through the branch decision, not through data. Most security taint tools deliberately ignore implicit flows: tracking them precisely is intractable and floods the report with noise. Knowing tools skip implicit flows tells you a real class of leak they will miss.

Where tools give up entirely (all detailed in senior.md and professional.md):

Reflection / dynamic dispatch — getattr(obj, name)(), Method.invoke() — the callee isn't known statically, so the call graph has holes.
eval / dynamic code — code constructed at runtime can't be analyzed.
Deserialization / ORMs / framework "magic" — data appears in variables the tool never saw assigned.
Native / FFI calls — opaque; the tool needs a hand-written model (stub) saying what taint they propagate.

The fix for all of these is the same: supply models — small declarations telling the tool "this framework function is a source," "this returns its argument's taint," "this is a sink." Modeling is the unglamorous core skill of running these tools well.

Core Concept 6 — Semgrep Taint Mode in Practice¶

Semgrep's taint mode is the most approachable real taint engine: you declare sources, sinks, and sanitizers as patterns and it does intraprocedural (and, in newer versions, limited interprocedural) propagation. A rule for the SQLi from the junior trace:

rules:
  - id: tainted-sql-execution
    mode: taint
    message: User input flows into a SQL query without parameterization.
    severity: ERROR
    languages: [python]
    pattern-sources:
      - pattern: request.args[...]
      - pattern: request.form[...]
    pattern-sanitizers:
      - pattern: db.execute("...", ...)      # parameterized call shape is safe
      - pattern: int($X)                       # casting to int removes SQLi risk
    pattern-sinks:
      - pattern: db.execute($SQL, ...)

Semgrep tracks taint from any source pattern to any sink pattern unless it passes a sanitizer pattern. Compared to a plain pattern: rule (which is pure AST matching, see Custom Lint Rules & AST), taint mode adds the flow — so it won't fire when the value was int()-cast first, and it will fire when the value travels through intermediate variables. The trade-off: Semgrep's interprocedural reach is far shallower than CodeQL's. Semgrep is fast and rule-author-friendly; CodeQL is deep and slow. Choosing between them is a senior/professional decision.

Real-World Examples¶

The branch that wasn't sanitized. A code path sanitized user input inside an if is_admin: block but used the raw value on the else path. Intraprocedural taint with proper merge-point handling flagged the unsanitized path; a careless reviewer missed it.
Summary reuse catching a deep bug. A utility render_template(name) deep in a shared library was a sink. The taint tool's summary for it propagated to dozens of call sites; one of them passed a request header. No human had connected those files.
The reflection blind spot. A plugin system dispatched handlers via getattr(module, action)(payload). The tool couldn't resolve action, so it never connected the request to the handler's SQL sink — a false negative fixed only by adding a model.
Semgrep noise from no sanitizer model. A team's custom safe_query() wrapper was safe, but Semgrep didn't know it, so every call lit up. Adding it as a pattern-sanitizers entry killed the noise.

Mental Models¶

The CFG is the road map; taint is a rumor spreading along it. At every fork the rumor takes both roads; at a merge, anyone who heard it on either road now knows it.
Summaries are pre-computed answers. Analyze a function once, write down "taint in arg → taint in return," reuse forever. This is dynamic programming for whole-program analysis.
Intraprocedural = reading one chapter; interprocedural = reading the whole book. A bug whose source and sink are in different chapters is invisible to per-chapter reading.
The tool only knows what you model. Frameworks, native calls, and reflection are black boxes until you describe them.

Common Mistakes¶

Sanitizing on one branch. A sanitizer on the if path doesn't protect the else path. Sanitize on every path to the sink, or immediately before it.
Expecting intraprocedural tools to find cross-function bugs. If source and sink are in different functions, you need interprocedural analysis or you'll get false negatives.
Forgetting to model custom sanitizers. Your safe_query() wrapper is invisible until declared a sanitizer — until then you drown in false positives.
Assuming taint tools catch implicit flows. Most ignore control-flow leaks by design. Don't rely on them for information-leak detection.
Treating Semgrep taint mode and CodeQL as interchangeable. Their depth and cost differ by an order of magnitude.

Test Yourself¶

Draw the CFG for a function with one if/else and identify the merge block.
What is a transfer function? Give the rule for z = a + b.
Why does assignment of a constant kill taint? What property of the analysis makes that possible?
Why is the junior-tier SQLi bug invisible to a purely intraprocedural analysis?
What is a function summary and why does it matter for scale?
At a CFG merge with one sanitized and one unsanitized path, is the variable tainted afterward? Why?
What is an implicit flow, and why do most security taint tools ignore it?
Write a Semgrep taint rule with one source, one sink, and one sanitizer.

Cheat Sheet¶

CFG               basic blocks (nodes) + control edges; merges are where soundness is decided
DATA-FLOW         in[B] --transfer(B)--> out[B], repeat until stable
TAINT             forward + may analysis (tainted on ANY incoming path = tainted)
PROPAGATION       y=x copies taint; a+b unions; assign-constant KILLS taint
                  default = propagate; sanitizer = the only thing that clears

INTRA vs INTER
  intraprocedural   one function; fast; blind at calls (false negatives across functions)
  interprocedural   follows calls via call graph + SUMMARIES; powerful; expensive
  context-sensitive distinguishes call sites; fewer FPs; costlier

TOOLS GIVE UP ON  reflection, dynamic dispatch, eval, deserialization, native/FFI → need MODELS
IMPLICIT FLOW     taint via branch decision, not assignment → most tools IGNORE it

SEMGREP TAINT MODE
  mode: taint + pattern-sources / pattern-sanitizers / pattern-sinks
  shallow interprocedural; fast; rule-author-friendly (vs CodeQL: deep + slow)

Summary¶

A program becomes a control-flow graph; a data-flow analysis pushes facts along it via transfer functions until they stabilize. Taint is the forward, may-flavored instance: untrusted values propagate by default and are cleared only at modeled sanitizers; a tainted value live at a sink is a finding. The decisive axis is intraprocedural vs interprocedural — staying in one function is cheap but misses cross-function bugs, while following calls (via a call graph and summaries) is where the real power and nearly all the cost reside. Tools go blind at reflection, eval, deserialization, and native calls, requiring hand-written models, and most deliberately ignore implicit flows. Semgrep's taint mode is the friendliest place to practice declaring sources, sinks, and sanitizers. The formal lattice/fixpoint underpinning and the precision/soundness trade-offs are the next tier.