Dynamic Instrumentation & eBPF — Hands-On Exercises¶
Topic: Dynamic Instrumentation & eBPF Roadmap
Observe a running program — or the Linux kernel itself — without touching source, recompiling, or restarting it. These exercises take you from a
BEGINhello-world to a libbpf CO-RE tool and a PID-targeted incident script.Prerequisites: A Linux box (bare metal, VM, or WSL2 with a real kernel), root /
sudo(orCAP_BPF+CAP_PERFMONon 5.8+), andbpftrace+bcc-toolsinstalled. A recent kernel helps a lot: 5.4+ for most of this, 5.8+ forfentry/fexitandBPF_MAP_TYPE_RINGBUF. CO-RE needs the kernel built with BTF (CONFIG_DEBUG_INFO_BTF=y) — check withls -l /sys/kernel/btf/vmlinux. If that file is missing, the Capstone CO-RE exercise won't work without kernel headers and the legacy BCC path.
Table of Contents¶
- Introduction
- Warm-Up
- Core
- Advanced
- Capstone
- If you can do all of these, you have the senior level
- Related Topics
Introduction¶
These exercises are meant to be typed and run, not read. Each one gives you a goal, a setup, the steps, a collapsible solution with the exact command and an explanation, and a one-line takeaway. Difficulty climbs section by section.
Set up an environment. On Debian/Ubuntu:
On Fedora/RHEL: sudo dnf install bpftrace bcc-tools kernel-devel. No spare Linux box? Spin up a throwaway VM (multipass launch --name ebpf 22.04 or Lima/Vagrant). Containers usually don't give you the privileges these probes need — use a VM or the host.
Sanity checks before you start:
uname -r # kernel version — 5.8+ unlocks fentry + ringbuf
ls -l /sys/kernel/btf/vmlinux # exists => BTF present => CO-RE works
sudo bpftrace -e 'BEGIN { printf("ok\n"); exit(); }' # toolchain works?
id # are you root, or do you have CAP_BPF?
BCC tools install with a -bpfcc suffix on Debian/Ubuntu (execsnoop-bpfcc). On other distros they may be plain execsnoop under /usr/share/bcc/tools/. Adjust names to match your system.
Safety. eBPF programs are verified and can't crash the kernel, but a chatty
printfprobe on a hot path (e.g. everysys_enter) can flood your terminal and add measurable overhead. Aggregate (count(),hist()) instead of printing per-event whenever you can. Don't run heavy or experimental probes on shared production hosts — use your own VM.
Warm-Up¶
W1 — Your first probe: BEGIN hello-world¶
Goal: Confirm the toolchain runs and learn the probe → action shape.
Setup: Toolchain installed, sudo available.
Do it: Print a line once when the program starts, then exit cleanly.
Solution / Hint
`BEGIN` is a special probe that fires once when the script loads (mirror: `END` fires at teardown, where aggregated maps are auto-printed). `exit()` tells bpftrace to detach and quit. Every bpftrace statement is **`probe { action }`** — you'll repeat that shape forever.What you learned: The probe { action } model and the BEGIN/END lifecycle hooks.
W2 — Discover what you can hook with -l¶
Goal: List available probes so you stop guessing names.
Setup: None.
Do it: List every syscall-enter tracepoint, then every probe whose name contains vfs_.
Solution / Hint
`-l` lists probes matching a glob. The probe taxonomy you'll use: **`tracepoint:`** (stable kernel-defined hooks — prefer these), **`kprobe:` / `kretprobe:`** (any kernel function entry/return — powerful but tied to internal names that change between versions), **`uprobe:` / `uretprobe:`** (user-space functions), **`usdt:`** (statically-defined user probes), **`profile:` / `interval:`** (timer-based), and **`software:` / `hardware:`** (perf events).What you learned: How to enumerate probes and the difference between stable tracepoints and version-fragile kprobes.
W3 — Count syscalls per process (one-liner)¶
Goal: Build your first aggregation map.
Setup: Have some activity running (a find /, a browser, anything).
Do it: Count syscalls grouped by process name. Let it run ~10s, then Ctrl-C.
Solution / Hint
`@` is a **map**. `@[comm]` keys it by the current process name (`comm`); `count()` increments per hit. On Ctrl-C, bpftrace runs the implicit `END` and prints every map sorted by value. `raw_syscalls:sys_enter` catches **all** syscalls in one probe — cheaper and more complete than attaching to each `syscalls:sys_enter_*` individually.What you learned: Map aggregation with count() keyed by comm — the workhorse pattern that avoids per-event printf overhead.
W4 — See processes and file opens live with BCC¶
Goal: Use prebuilt BCC tools instead of writing scripts.
Setup: bpfcc-tools installed. Open a second terminal to generate activity (run ls, cat /etc/hostname, launch an editor).
Do it: Watch every new process exec, then watch every file open, in real time.
Solution / Hint
These are Python wrappers around eBPF programs maintained in the BCC project. `execsnoop` hooks the exec path; `opensnoop` hooks the open path. They're the first reach-for tools in an incident: "what just ran?" and "what file is it failing to open?" Add `-x` to `opensnoop-bpfcc` to show only failed opens.What you learned: The BCC tool catalog gives you battle-tested observability for free — know execsnoop and opensnoop cold.
W5 — Trace openat with process name and filename¶
Goal: Read syscall arguments and turn a kernel pointer into a string.
Setup: Open a second terminal; run cat /etc/hostname while your trace is live.
Do it: Print the process name and the path for every openat.
Solution / Hint
`args->filename` is a **pointer in user memory**; `str()` safely copies it into the BPF program and bounds it. Find available fields with `sudo bpftrace -lv tracepoint:syscalls:sys_enter_openat`. This is per-event `printf`, so it's noisy by design — fine for a focused look, but switch to a map the moment you want totals.What you learned: Reading args->, dereferencing user pointers with str(), and discovering argument names with -lv.
Core¶
C1 — Function-latency histogram (the entry/exit tid-join pattern)¶
Goal: Measure how long a kernel function takes and plot the distribution.
Setup: Generate disk/VFS activity (dd if=/dev/zero of=/tmp/x bs=1M count=200; sync).
Do it: Histogram the latency of vfs_read: timestamp on entry, subtract on return.
Solution / Hint
The pattern: stash the entry timestamp keyed by **`tid`** (thread id), then on return compute the delta and feed it to `hist()` (log2 buckets). Three things matter: - **Key by `tid`, not `pid`.** `pid` in bpftrace is the thread-group id (the userspace "PID"); multiple threads share it, so concurrent calls would clobber each other's start time. `tid` is the unique kernel thread id. - **The `/@start[tid]/` filter** drops returns that have no matching entry (the probe started mid-call). - **`delete(@start[tid])`** frees the slot so the map doesn't grow unbounded.What you learned: The canonical entry→exit latency measurement, why tid keying is mandatory, and hist() for distributions.
C2 — Capture syscall args AND return code, joined by tid¶
Goal: Correlate what a call asked for with how it ended.
Setup: Run cat /nonexistent and cat /etc/hostname while tracing to get both a failure and a success.
Do it: Record the filename on sys_enter_openat, then on sys_exit_openat print the filename together with the return value (fd or -errno).
Solution / Hint
The enter and exit tracepoints are **separate probes** firing at different times — `tid` is the join key that ties the request to its result. `args->ret` on the exit tracepoint is the syscall return: `>= 0` is the fd, negative is `-errno` (e.g. `-2` = `ENOENT`). Same stash-on-enter / use-and-delete-on-exit discipline as C1.What you learned: Joining two probes by tid, and reading negative-errno return conventions.
C3 — Scope a trace to a single PID¶
Goal: Cut noise by tracing exactly one process.
Setup: Pick a target: pgrep -n firefox, or start sleep 1000 & and note its PID.
Do it: Count syscalls by name only for that PID.
Solution / Hint
The `/pid == N/` **predicate** runs first; the action only executes when it's true, so you filter in-kernel before any aggregation. Note `pid` here is the userspace PID (thread-group id) — exactly what `pgrep` returns, so the match is correct. (Many BCC tools accept `-p PID` for the same effect, e.g. `opensnoop-bpfcc -p $TARGET`.)What you learned: In-kernel filtering with predicates, and the pid vs tid distinction in a filtering context.
C4 — Watch TCP connects and retransmits with BCC¶
Goal: Diagnose network behavior without tcpdump.
Setup: Generate traffic: curl -s https://example.com >/dev/null in another terminal; for retransmits, hit a flaky/remote endpoint or use tc to inject loss.
Do it: Trace outbound TCP connections, then trace retransmissions.
Solution / Hint
`tcpconnect` hooks the kernel's `tcp_v4_connect`/`tcp_v6_connect` path, giving you the connection 5-tuple and the responsible process — far lighter than packet capture because it only fires on connection setup. `tcpretrans` fires on the retransmit path; a steady stream of these points at packet loss or an overloaded peer. Both are kprobe-based, so behavior can vary slightly across kernels.What you learned: Network-event tracing at the socket layer with tcpconnect/tcpretrans, and why it beats packet capture for "who is connecting / what is dropping."
C5 — uprobe your OWN binary's function (no source change, no recompile)¶
Goal: Instrument a user-space function in a running binary you didn't modify.
Setup: Write and compile a tiny program once, then leave it untouched.
// add.c
#include <stdio.h>
#include <unistd.h>
long add(long a, long b) { return a + b; }
int main(void) {
for (long i = 0; ; i++) { volatile long r = add(i, i + 1); (void)r; sleep(1); }
}
gcc -O0 -o add add.c # -O0 so 'add' isn't inlined away; symbol must survive
./add & # leave it RUNNING
Do it: Without recompiling or restarting ./add, trace each call to add and print its two arguments.
Solution / Hint
`uprobe:BINARY:SYMBOL` attaches to the function entry **in the live process's text** — no source edit, no recompile, no restart. `arg0`, `arg1` are the function arguments by calling convention (on x86-64 SysV: `rdi`, `rsi`). Use `uretprobe:./add:add { printf("=> %d\n", retval); }` for the return value. Caveats that bite people: the symbol must exist in the binary (don't `strip` it; static/exported functions are easiest). At `-O2` the compiler may inline `add`, leaving no entry point to hook — that's why we used `-O0`. For a function in a shared library, point the path at the `.so` (e.g. `uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc`).What you learned: The core promise of dynamic instrumentation — hooking a running, unmodified binary — plus inlining and symbol-stripping gotchas.
C6 — Disk I/O latency histogram with BCC¶
Goal: See block-device latency as a distribution.
Setup: Generate I/O: dd if=/dev/zero of=/tmp/io.bin bs=1M count=500 oflag=direct (or a fio job) while tracing.
Do it: Plot block I/O latency, then break it down by disk.
Solution / Hint
`biolatency` is the BCC version of C1's pattern applied to the block layer: it timestamps requests at issue and computes latency at completion, bucketing into a log2 histogram. Reading the histogram matters more than the mean — a fat tail at high latency (a bimodal distribution) often reveals a struggling device or queue contention that an average would hide.What you learned: Reading latency distributions (not averages) and applying the entry/exit pattern at the block layer via a prebuilt tool.
Advanced¶
A1 — Off-CPU time analysis via sched:sched_switch¶
Goal: Find where a thread spends time blocked (waiting), not running.
Setup: Run something that blocks on I/O or locks (dd, a database, sleep-heavy app).
Do it: Measure how long each thread stays off-CPU between being switched out and switched back in, as a per-comm histogram.
Solution / Hint
sudo bpftrace -e '
tracepoint:sched:sched_switch {
// record when the OUTGOING thread leaves the CPU
@off[args->prev_pid] = nsecs;
// measure how long the INCOMING thread was away
$t = @off[args->next_pid];
if ($t) {
@us[args->next_comm] = hist((nsecs - $t) / 1000);
delete(@off[args->next_pid]);
}
}'
What you learned: Off-CPU vs on-CPU analysis, and harvesting both sides of sched_switch to measure blocked time.
A2 — Aggregate user+kernel stacks to find the culprit¶
Goal: Build a flame-graph-style "where is the time going" list.
Setup: A CPU-busy target (yes > /dev/null &, or your app under load); note its PID.
Do it: Sample stacks and rank the hottest user+kernel call paths.
Solution / Hint
Each unique `[kstack, ustack]` pair becomes a map key; `count()` ranks them, so the bottom of the printed output (highest count) is your hot path. Pipe the folded output into Brendan Gregg's `flamegraph.pl` to render an actual flame graph, or use BCC's `profile-bpfcc -f` for folded format directly. For readable userspace frames you need symbols — un-stripped binaries, or `-fno-omit-frame-pointer` builds (or DWARF unwinding) so the stack walker can resolve names instead of raw addresses.What you learned: Stack aggregation as map keys for flame graphs, and why symbol/frame-pointer availability makes or breaks readability.
A3 — CPU profiling with profile:hz:99¶
Goal: Sampled CPU profiling — understand why 99 Hz.
Setup: Run a CPU-bound workload.
Do it: Sample the on-CPU process name 99 times per second across all CPUs for 10s.
Solution / Hint
`profile:hz:99` fires a timer on **every CPU** 99 times a second. Why 99 and not 100? To avoid **lockstep** — if you sample at exactly 100 Hz you can phase-align with periodic kernel work (timers, the 100/250/1000 Hz scheduler tick) and systematically over- or under-count it. An off-by-one rate like 99 dodges that aliasing. The whole approach is *statistical*: more samples → more accurate profile, but it can miss rare short events entirely. `interval:s:10 { exit(); }` bounds the run.What you learned: Sampled (statistical) profiling, the 99 Hz anti-aliasing trick, and timer-bounded runs with interval.
A4 — fentry/fexit vs kprobe¶
Goal: Use the modern BPF trampoline hooks and know when they win.
Setup: Kernel 5.5+ with BTF (fexit needs 5.5+; both need BTF).
Do it: Histogram vfs_read latency with fentry/fexit and compare to the C1 kprobe version.
Solution / Hint
`fentry`/`fexit` attach via a BPF **trampoline** instead of a breakpoint trap, so they're lower-overhead than `kprobe`/`kretprobe`. Two more wins: `fexit` can read the **function arguments** (a `kretprobe` only sees the return value — that's the whole reason C1/C2 had to stash args on the entry probe), and they take typed BTF arguments (`args->...`) instead of raw `arg0`. The cost: they require kernel 5.5+ and BTF, so they're less portable than kprobes, which work almost anywhere. Rule of thumb: **prefer `fentry`/`fexit` on modern kernels, fall back to kprobes for portability.**What you learned: BPF trampolines vs kprobe traps, that fexit sees args (kretprobe doesn't), and the portability trade-off.
A5 — Trigger a verifier rejection in BCC, then fix it¶
Goal: Understand the verifier by deliberately failing it.
Setup: BCC installed (python3-bpfcc).
Do it: Write a BCC program that dereferences a kernel-returned pointer without a NULL check, watch the verifier reject it, then add the check.
Solution / Hint
**The rejected version** — `bpf_get_current_task()` may return NULL, and we dereference it blindly:# verifier_fail.py
from bcc import BPF
prog = r'''
#include <linux/sched.h>
int hook(void *ctx) {
struct task_struct *t = (struct task_struct *)bpf_get_current_task();
int pid = t->pid; // verifier: dereferencing possibly-NULL pointer
bpf_trace_printk("pid=%d\n", pid);
return 0;
}
'''
BPF(text=prog).attach_kprobe(event="vfs_read", fn_name="hook")
What you learned: What the verifier checks and why, how to read a rejection, and that NULL-checks aren't optional — they're how you prove safety to the kernel.
Capstone¶
CAP1 — A real libbpf CO-RE tool (ringbuf openat tracer)¶
Goal: Write a portable, compile-once-run-everywhere BPF tool with a kernel-side program and a user-space loader — the production shape of eBPF tooling.
Setup: clang, llvm, libbpf-dev, bpftool, BTF kernel. Generate the kernel type header once:
Do it: Build a tool that pushes one event per openat (pid, comm, filename) through a ringbuf to userspace, which prints them. CO-RE (Compile Once – Run Everywhere) means the single compiled object relocates field offsets against the running kernel's BTF, so it runs on kernels you never compiled against.
Solution / Hint
**Kernel side — `opensnoop.bpf.c`:**#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
char LICENSE[] SEC("license") = "GPL";
struct event {
__u32 pid;
char comm[16];
char filename[128];
};
/* ring buffer: 256 KB */
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} rb SEC(".maps");
SEC("tp/syscalls/sys_enter_openat")
int handle_openat(struct trace_event_raw_sys_enter *ctx)
{
struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) // verifier-required NULL check
return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
const char *fn = (const char *)ctx->args[1]; // openat: filename is arg #1
bpf_probe_read_user_str(e->filename, sizeof(e->filename), fn);
bpf_ringbuf_submit(e, 0);
return 0;
}
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 \
-c opensnoop.bpf.c -o opensnoop.bpf.o
bpftool gen skeleton opensnoop.bpf.o > opensnoop.skel.h
#include <stdio.h>
#include <signal.h>
#include <bpf/libbpf.h>
#include "opensnoop.skel.h"
struct event { unsigned int pid; char comm[16]; char filename[128]; };
static volatile int stop = 0;
static void on_sig(int s) { stop = 1; }
static int on_event(void *ctx, void *data, size_t len) {
struct event *e = data;
printf("%-8u %-16s %s\n", e->pid, e->comm, e->filename);
return 0;
}
int main(void) {
struct opensnoop_bpf *skel = opensnoop_bpf__open_and_load(); // open + verify + load
if (!skel) { fprintf(stderr, "open_and_load failed\n"); return 1; }
if (opensnoop_bpf__attach(skel)) { fprintf(stderr, "attach failed\n"); goto out; }
struct ring_buffer *rb =
ring_buffer__new(bpf_map__fd(skel->maps.rb), on_event, NULL, NULL);
if (!rb) { fprintf(stderr, "ringbuf new failed\n"); goto out; }
signal(SIGINT, on_sig);
printf("%-8s %-16s %s\n", "PID", "COMM", "FILENAME");
while (!stop)
ring_buffer__poll(rb, 100 /*ms*/); // drain events, 100ms timeout
ring_buffer__free(rb);
out:
opensnoop_bpf__destroy(skel);
return 0;
}
What you learned: The full production eBPF shape — vmlinux.h + CO-RE relocations, a ringbuf map, a SEC() tracepoint handler, and a skeleton-driven libbpf loader.
CAP2 — Trace USDT probes in a JVM or CPython¶
Goal: Hook statically-defined user probes baked into a runtime — zero source change to your app.
Setup: A runtime built with USDT/DTrace probes. The JVM has them in most distributions (try --enable-dtrace builds / OpenJDK); CPython needs to be configured --with-dtrace (stock distro Python usually isn't, so you may need to build it or use a tracing-enabled build).
Do it: First list the probes a binary exposes, then trace one.
Solution / Hint
**List probes** in a target (point the path at the actual binary/lib):sudo bpftrace -l 'usdt:/usr/lib/jvm/.../lib/server/libjvm.so:*' # JVM
sudo bpftrace -l 'usdt:/usr/local/bin/python3.12:*' # CPython --with-dtrace
What you learned: USDT as stable, author-blessed probe points; how to list them; and that they only exist when the runtime was built with them enabled.
CAP3 — A PID-targeted incident toolkit .bt script¶
Goal: Package everything into one reusable tool an on-call engineer can point at a misbehaving PID.
Setup: Save the script, chmod +x, pick a victim PID.
Do it: Write incident.bt that takes a PID as $1 and, for that PID, reports (a) syscall latency as a histogram and (b) the top syscalls by count — printing a summary every 5 seconds.
Solution / Hint
#!/usr/bin/env bpftrace
// incident.bt — usage: sudo ./incident.bt <PID>
BEGIN {
printf("Tracing PID %d ... Ctrl-C to stop. 5s summaries.\n", $1);
}
// stash entry time per thread, only for our target PID
tracepoint:raw_syscalls:sys_enter /pid == $1/ {
@start[tid] = nsecs;
@count[probe] = count(); // top syscalls (probe name = which call)
}
// on exit, compute latency for matched entries
tracepoint:raw_syscalls:sys_exit /pid == $1 && @start[tid]/ {
@latency_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}
// periodic snapshot
interval:s:5 {
time("\n=== %H:%M:%S ===\n");
print(@latency_us);
print(@count, 5); // top 5 only
clear(@latency_us);
clear(@count);
}
END { clear(@start); }
What you learned: Composing probes, predicates, args, aggregation, and intervals into a single deployable triage tool — the practical payoff of everything above.
If you can do all of these, you have the senior level¶
You can attach to any layer of a running system — syscalls, kernel functions (kprobe and fentry/fexit), user functions (uprobe), and runtime USDT markers — without editing, recompiling, or restarting a thing. You instinctively reach for aggregation over per-event printf, key latency by tid with delete-on-exit, scope to a PID to kill noise, and read distributions instead of averages. You understand the verifier well enough to predict and fix a rejection, you know when fentry/fexit beats a kprobe (and when portability sends you back to kprobes), and you can build a real libbpf CO-RE tool — vmlinux.h, a ringbuf, a SEC() handler, a skeleton loader — that ships as one binary and runs across kernel versions. If you can hand an on-call engineer a .bt script that takes a PID and surfaces the culprit in five seconds, you're operating at the senior level.
Related Topics¶
- Continuous Profiling — always-on, fleet-wide stack sampling (the productionized cousin of A2/A3).
- Tracing — distributed request tracing; eBPF can feed and correlate with it.
- Metrics — aggregate counters/histograms; eBPF maps are a metrics source.
- Logging — event records; contrast with eBPF's structured, low-overhead events.
- Debugging — interactive and post-mortem techniques alongside live instrumentation.
- Observability Engineering — how instrumentation, traces, metrics, and logs combine into a coherent practice.
In this topic
- interview
- tasks