Taint & Data-Flow Analysis — Junior Level¶
Roadmap: Static Analysis → Taint & Data-Flow Analysis
Following untrusted data through your program until it hits something dangerous.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concept 1 — Source, Sanitizer, Sink
- Core Concept 2 — What "Tainted" Means
- Core Concept 3 — Following the Flow: A Worked Trace
- Core Concept 4 — The Three Big Findings: SQLi, XSS, Command Injection
- Core Concept 5 — Why a Tool Can Do This Better Than
grep - Real-World Examples
- Mental Models
- Common Mistakes
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
Introduction¶
Focus: the one idea that makes data-flow security analysis click — dangerous data flows from a SOURCE, and the bug is when it reaches a SINK without first being cleaned by a SANITIZER.
Most security bugs are not exotic. They are the same shape over and over: some data the attacker controls travels through your program and ends up somewhere it can do damage. A search box value ends up inside a SQL query. A username ends up inside an HTML page. A filename from a URL ends up inside a shell command.
A SAST tool that does taint analysis automates the question a careful reviewer asks: "Where did this value come from, and where is it going?" It tracks tainted (untrusted) values as they move from variable to variable, function to function, and raises an alarm the moment a tainted value arrives at a dangerous operation without being made safe along the way.
This page teaches the intuition with no math. The lattices, fixpoints, and formal frameworks live in senior.md and professional.md. For now, learn to see the flow.
Prerequisites¶
- You can read code in at least one language (Python, JavaScript, Go, or Java).
- You know roughly what a SQL query and an HTTP request are.
- You have heard of SQL injection even if you couldn't define it precisely.
- Helpful: you have run a linter or SAST tool once (see SAST & Security Scanners).
Glossary¶
| Term | Meaning |
|---|---|
| Taint | A label meaning "this value came from an untrusted source and hasn't been cleaned." |
| Source | Where untrusted data enters the program (a request parameter, form field, file upload, header). |
| Sink | A dangerous operation a value flows into (a SQL query, exec(), writing to an HTML page). |
| Sanitizer | A function that removes the danger from a value (escaping, parameterizing, validating). |
| Propagation | How taint spreads — if b = a and a is tainted, b becomes tainted too. |
| Finding | One reported issue: a tainted value reached a sink without a sanitizer. |
| Data-flow | The path a value takes through the program, variable by variable. |
| False positive | A reported finding that isn't actually exploitable. |
| False negative | A real bug the tool missed. |
Core Concept 1 — Source, Sanitizer, Sink¶
Three words explain almost all of injection security. Memorize them.
SOURCE ───────────► (your code) ───────────► SINK
attacker-controlled dangerous operation
data enters here happens here
SANITIZER
makes the data safe;
if data passes through one,
it is no longer tainted
- A source is a doorway for untrusted data.
request.args["q"],os.Args,req.body.name, an HTTP header, a message off a queue, a row from a database that another user wrote. - A sink is a place where data, if attacker-controlled, becomes a weapon. A SQL query string,
os/execcommand,eval(),innerHTML, a file path, a redirect URL. - A sanitizer is the fix. It transforms the value so it can no longer break out of its intended meaning: a parameterized query placeholder, an HTML escaper, an allow-list validator.
The rule, in one sentence: Tainted data reaching a sink without passing a sanitizer is a vulnerability. Everything else in this topic is making that rule precise, automatic, and scalable.
Core Concept 2 — What "Tainted" Means¶
"Tainted" is just a sticky label. Think of untrusted data as wet paint. The moment it enters your program it is wet (tainted). Touch it, copy it, concatenate it — your hands (the new variables) get paint on them too. The paint only dries (becomes safe) when it passes through a sanitizer.
name = request.args["name"] # tainted (source)
greeting = "Hello " + name # tainted (paint spread by concatenation)
upper = greeting.upper() # tainted (still wet — upper() doesn't clean)
safe = html.escape(upper) # CLEAN (sanitizer dried the paint)
Notice that upper() does not clean the value — it just changes the letters. Only a function that addresses the specific danger counts as a sanitizer. For HTML, that's HTML-escaping. For SQL, it's a parameterized query. Using the wrong sanitizer (e.g. HTML-escaping a value that goes into SQL) leaves the bug wide open.
Core Concept 3 — Following the Flow: A Worked Trace¶
Here is the exact reasoning a taint tool performs. A request parameter travels through two functions and lands in a SQL query with no sanitizer.
# --- web layer ---
@app.route("/user")
def get_user():
uid = request.args["id"] # [1] SOURCE: tainted enters here
return render(lookup(uid)) # [2] tainted passed as argument
# --- service layer ---
def lookup(user_id):
query = build_query(user_id) # [3] tainted passed deeper
return db.execute(query) # [6] SINK: tainted reaches db.execute ← BUG
# --- helper ---
def build_query(uid):
return "SELECT * FROM users WHERE id = " + uid # [4][5] tainted woven into SQL
The tool's trace, step by step:
[1] request.args["id"] tainted (source matched)
[2] uid → lookup(uid) tainted (flows into parameter user_id)
[3] user_id → build_query(user_id) tainted (flows into parameter uid)
[4] "SELECT ... = " + uid tainted (concatenation propagates taint)
[5] return value of build_query tainted (returned tainted string)
[6] db.execute(query) tainted reaches SINK, no sanitizer → FINDING
No sanitizer ever ran. An attacker passing ?id=1 OR 1=1-- reads every user. The fix is a sanitizer between source and sink — here, a parameterized query:
def build_query(uid):
return ("SELECT * FROM users WHERE id = ?", [uid]) # ? placeholder = sanitizer
Now uid is passed as a bound parameter, not woven into the SQL text. The database treats it as data, never as code. The taint tool sees the value go through the parameterized-query API — a known sanitizer — and the finding disappears.
If you can do this trace by hand on the example above, you understand the core of every SAST data-flow engine.
Core Concept 4 — The Three Big Findings: SQLi, XSS, Command Injection¶
The same source→sink shape produces the three most common injection bugs. Only the sink and the correct sanitizer change.
| Vulnerability | Source (example) | Sink | Correct sanitizer |
|---|---|---|---|
| SQL injection | request.args["id"] | db.execute(sql_string) | Parameterized query / bound params |
| Cross-site scripting (XSS) | request.form["bio"] | innerHTML = … / HTML template without escaping | HTML-context escaping / safe templating |
| Command injection | request.args["file"] | os.system(cmd) / exec(shell=True) | Argument arrays (no shell) / strict allow-list |
These map directly to the sql-injection-prevention, xss-prevention, and input-validation skills — read those for the defensive side. Taint analysis is the detective side: it finds the places where you forgot the defense.
Core Concept 5 — Why a Tool Can Do This Better Than grep¶
You could grep for db.execute. But grep finds every call, including the safe parameterized ones, and tells you nothing about where the argument came from. A taint tool is smarter because it tracks the flow:
- It knows
db.execute(parameterized_query)is safe anddb.execute(tainted_string)is not — same function, different verdict. - It follows the value across function calls, so a sink in
build_querythree calls deep from the source still gets connected. - It knows a sanitizer in the middle clears the alarm, so it doesn't cry wolf on code you already fixed.
That is the whole value proposition: grep matches text; data-flow analysis matches journeys.
Real-World Examples¶
- The classic SQLi breach. A login form concatenates the username into a SQL string. Attacker submits
admin'--, bypasses the password check, owns the account. A taint tool flagsrequest.form["username"]→ SQL sink on the first scan. - Stored XSS in a profile bio. A user saves
<script>steal()</script>as their bio; another user views the profile and the script runs in their browser. The source (form["bio"]) is stored, retrieved later, and rendered without escaping — taint analysis that models the database as a tainted source catches it. - Command injection in a "convert file" feature. A web tool passes a user-supplied filename into
os.system("convert " + name). Attacker sendsname=x.png; rm -rf /. The taint tool connects the URL parameter to the shell sink. - The false positive. A tool flags
db.execute(query)but a reviewer sees the value was actually validated against a hard-coded allow-list two lines up. The tool didn't recognize that validation as a sanitizer — a model problem you'll learn to fix at higher tiers.
Mental Models¶
- Wet paint. Untrusted data is wet paint; it smears onto everything it touches and only dries at a sanitizer.
- Airport security. Sources are the entrances, sinks are the boarding gates, sanitizers are the security checkpoint. The finding is a passenger who reached the gate without going through screening.
- Follow the money. Taint analysis is forensic accounting for data: where did this value come from, who handled it, and did it ever get laundered (sanitized) before doing something risky?
- The verdict depends on the journey, not the destination. The same sink is safe or dangerous depending on what flowed into it.
Common Mistakes¶
- Thinking the sink is the bug.
db.executeis not dangerous — tainted data reaching it is. Don't ban the sink; sanitize the source. - Mistaking transformation for sanitization.
.upper(),.strip(),.trim()change the text but don't remove the danger. Only a context-appropriate sanitizer counts. - Using the wrong sanitizer. HTML-escaping does nothing against SQL injection. Match the sanitizer to the sink's context.
- Assuming "it's internal, so it's safe." Data from your own database can be tainted if another user wrote it (stored XSS). The trust boundary, not the source code boundary, is what matters.
- Trusting
grep-style scanning. Text matching can't tell a safe sink from a dangerous one. Use a tool that follows data flow.
Test Yourself¶
- Define source, sink, and sanitizer in one sentence each.
- In the worked trace, which line is the source and which is the sink? Why is there no sanitizer?
- Is
name.upper()a sanitizer? Why or why not? - Why is the same
db.executecall sometimes a finding and sometimes safe? - A username is saved to a database, then later displayed on a page without escaping. Name the source, the sink, and the vulnerability.
- Why can a data-flow tool find bugs that
grepcannot?
Cheat Sheet¶
THE ONE RULE
tainted value → reaches sink → no sanitizer in between → VULNERABILITY
THE THREE WORDS
SOURCE untrusted data enters (request param, header, upload, other-user DB row)
SINK dangerous operation (SQL exec, shell, innerHTML, file path, eval)
SANITIZER makes data safe (parameterized query, HTML escape, allow-list)
THE BIG THREE BUGS (same shape, different sink/sanitizer)
SQL injection → sink: SQL query → fix: parameterized query
XSS → sink: HTML output → fix: context escaping
Command injection → sink: shell call → fix: arg arrays / allow-list
PAINT METAPHOR
untrusted = wet paint; it smears on copy/concat; dries only at a sanitizer
Summary¶
Taint analysis automates one question: did attacker-controlled data reach a dangerous operation without being cleaned? Untrusted values enter at sources, spread as they're copied and passed around (propagation), and become a finding if they reach a sink without crossing a sanitizer. The three classic injection bugs — SQLi, XSS, command injection — are all this same shape with different sinks and different correct fixes. A data-flow tool beats grep because it tracks the journey of a value, not just where text appears. The rigorous machinery behind this — control-flow graphs, lattices, fixpoints, interprocedural summaries — is what the higher tiers build out.
Further Reading¶
- OWASP — SQL Injection, Cross-Site Scripting (XSS), and Command Injection cheat sheets (the canonical source/sink/sanitizer catalogue).
- The
sql-injection-prevention,xss-prevention, andinput-validationskills in this repository — the defensive counterparts to detection. - Semgrep — Taint mode tutorial (the gentlest hands-on introduction to source/sink/sanitizer in a real tool).
- SAST & Security Scanners —
junior.mdfor the tool-running basics this topic sits underneath.
Related Topics¶
- SAST & Security Scanners — taint analysis is the engine underneath serious SAST.
- Custom Lint Rules & AST — how patterns over code are written, the layer below data-flow.
- Dynamic Analysis & Sanitizers — the runtime counterpart that observes real tainted flows during execution.
- SQL Injection Prevention, XSS Prevention, Input Validation — the defensive skills referenced above.
In this topic
- junior
- middle
- senior
- professional