Dynamic Ftrace Filtering

Recently, while conducting research, I wanted to follow a specific kernel code flow: I was about to modify the behavior of an existing kernel mechanism, and I needed to understand the flow of data between different functions.

As with any open source project you’re trying to analyze, your go-to assistant is the source code. Linux, besides being open source, provides tons of dynamic tools for debugging, and they are a great supplement to static reading of the code.

In this post, I’ll focus on one of these tools, ftrace, and present a modification I made to make its filtering capabilities more versatile.

ftrace

At its base, ftrace (Function Tracer) is a dynamic function instrumentation infrastructure. It can be used to set dynamic traces on virtually all kernel functions, and also supports a large set of static tracepoints, used to record core kernel events. It is available in most modern Linux distributions.

It’s a powerful tool for tracing flows and events in the kernel. Providing both in-kernel API and usermode control interface via tracefs, it’s arguably one of the most comprehensive tracing tools in Linux.

ftrace has many modes of operation, in this post I’ll focus on the function_graph mode: It allows you to record the call graph originating from a given function, recursively.

In case you’ve never used it (or just as a slight reminder) — here’s how it looks when tracing the kernel path for a vfs_read call on /proc/version:

  8)               |  vfs_read() {
  8)               |    rw_verify_area() {
  8)               |      security_file_permission() {
  8)               |        apparmor_file_permission() {
  8)               |          common_file_perm() {
  8)   0.110 us    |            aa_file_perm();
  8)   0.322 us    |          }
  8)   0.483 us    |        }
  8)   0.079 us    |        __fsnotify_parent();
  8)   0.082 us    |        fsnotify();
  8)   1.004 us    |      }
  8)   1.180 us    |    }
  8)               |    __vfs_read() {
  8)               |      proc_reg_read() {
  8)               |        seq_read() {
  8)   0.106 us    |          rcu_all_qs();
  8)               |          kvmalloc_node() {
  8)               |            __kmalloc_node() {
  8)   0.099 us    |              kmalloc_slab();
  8)   0.086 us    |              rcu_all_qs();
  8)   0.098 us    |              should_failslab();
  8)   0.128 us    |              memcg_kmem_get_cache();
  8)   0.092 us    |              memcg_kmem_put_cache();
  8)   1.562 us    |            }
  8)   1.743 us    |          }
  8)   0.089 us    |          single_start();
  8)               |          version_proc_show() {
  8)               |            seq_printf() {
  8)   0.836 us    |              seq_vprintf();
  8)   1.004 us    |            }
  8)   1.158 us    |          }
....

That call graph can provide great aid in understanding flow, alongside reading the code.

This isn’t a starters guide for ftrace, so I will assume you know its basics from this point onward. If you want a longer introduction on ftrace and its additional modes, I recommend reading the articles in [1], [2] and [3].

Back here, I was trying to understand the receive flow of a specific TCP packet (which functions does it go through).

The TCP receive entry-point function is tcp_v4_rcv. We can function_graph it, but we’ll soon encounter a problem — there are heaps of data generated by graph-tracing this function: On my desktop, under a not-so-heavy network load, I was getting a 5MB/s trace file! That’s a lot of data.

There are 2 limitations pointed out here:

The calls didn’t undergo any filtering — all calls were written to the trace file. I wanted to see the graph of tcp_v4_rcv on a specific packet, not on all TCP packets processed by the system.
There’s no information attached to the calls (for example, function arguments). Take a look on the sample graph above — there are only function names and durations. How can we differentiate the “interesting” calls?

Existing filters?

Let’s seek help from existing ftrace features. For this purpose, we can start with the very long documentation.

ftrace can record the current PID/TGID and the command (/proc/pid/comm), and report them in the trace file. As far as I can tell, no more specific details are available to be recorded on each entry.

ftrace also supports filtering by PID: Using the set_ftrace_pid file, we can limit the traced calls to specific processes. For example, this will limit traces to the init process:

# cd /sys/kernel/tracing
# echo 1 > set_ftrace_pid

Sounds nice, eh? For many filtering use-cases, this is just enough: A simple program that writes its PID to set_ftrace_pid and then invokes some kernel code you wish to trace.

In the network RX case, it’s impossible⁴: Network packets are processed in software interrupts⁵ running possibly on any CPU, in the context of virtually any thread of the system. There goes filtering by PID…

I’m not aware of any other sophisticated filters allowed by ftrace. We can try to filter out the events externally (e.g in the case of network RX — filter away unwanted packets with iptables, this way they won’t end up in tcp_v4_rcv)…

… or, we can tweak ftrace to provide the filtering we need!

Tweaking ftrace Filtering

All following Linux references are based on kernel 5.0+.

What I had in mind: ftrace has its PID-based filtering. I can provide external filtering, and after a call, I wish to trace passes my filter, the PID of the currently running process can be “added” to set_ftrace_filter, effectively making the process traced (and I’ll get the call graph I desired)

What happens when you write to set_ftrace_pid? A quick search of this string yields ftrace_init_tracefs which registers the tracefs file; And ftrace_pid_write actually implements the write.

ftrace_pid_write is rather long for its seemingly simple purpose. Summing it up, this function does 3 things:

Creates a trace_pid_list struct from the input PIDs (you can write multiple PIDs for tracing), and stores it for later use.
Registers a sched_switch event handler called ftrace_filter_pid_sched_switch_probe.
Updates tracers that use PID filtering to have ftrace_pid_func as their tracing handler.

Effectively, the 3rd step enables filtering. ftrace_pid_func is very simple: It is called instead of the original tracing handler and checks whether the current task should be traced or not, by reading tr->trace_buffer.data->ftrace_ignore_pid. Iff tracing should be performed, it calls the original handler.

Let’s see where this field is updated. ftrace_pid_write has previously registered ftrace_filter_pid_sched_switch_probe for the sched_switch event. This handler will be called for each task switch in the system. It knows which is the next task to run.

static void
ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
                                     struct task_struct *prev, struct task_struct *next)
{
    struct trace_array *tr = data;
    struct trace_pid_list *pid_list;
    pid_list = rcu_dereference_sched(tr->function_pids);
    this_cpu_write(tr->trace_buffer.data->ftrace_ignore_pid,
            trace_ignore_this_task(pid_list, next));
}

It calls trace_ignore_this_task to see whether the new task’s PID was marked for tracing:

bool
trace_ignore_this_task(struct trace_pid_list *filtered_pids, struct task_struct *task)
{
   /*
    * Return false, because if filtered_pids does not exist,
    * all pids are good to trace.
    */
    if (!filtered_pids)
        return false;

    return !trace_find_filtered_pid(filtered_pids, task->pid);
}

The result is written into the ftrace_ignore_pid field, itself a percpu⁶ variable. This is a caching optimization: instead of filtering PIDs through the list whenever a task hits a traced function, the check of a PID is done once on task switch, and later the tracing decision is a mere cache read of this single boolean. Neat, and a very nice use of percpu as well! The result will be automatically invalidated upon the next call to ftrace_filter_pid_sched_switch_probe, which will write a new value for the next task.

I think we know enough. If we can update the percpu ftrace_ignore_pid, we can control whether tracing is enabled for the current CPU. If we can set it once we want to start tracing and unset it once we want to stop, we can use it to limit filtering at will.

So, I started writing a kernel module to work this out. Shortly enough, I recalled I worked on a MicroPython port for the Linux kernel⁷ just for such occasions! Using it, we can quickly define a few hooks that’ll do the job.

Our goal: given an input TCP packet that matches a precondition (say, destination port is 9000) we’d like to get its ftrace call-graph of tcp_v4_rcv.

We’ll start with a hook that’ll do the actual filtering on the packets. It has to be called before ftrace-graph, otherwise, we won’t be able to control the filtering! If I hook tcp_v4_rcv itself with ftrace, it runs after the ftrace-graph filtering decision, which is too late for us. I’m sure there’s a way around it, but I didn’t bother checking — instead, I hooked the tcp_protocol->handler pointer, which is used to call tcp_v4_rcv. This way, we’re surely called before it.

def my_pre_tcp_v4_rcv(skb):
    # TODO: do the filtering stuff...
    return tcp_v4_rcv(skb)

cb = callback(my_pre_tcp_v4_rcv)
net_protocol_s = partial_struct("net_protocol")
net_protocol_s(tcp_protocol).handler = cb.ptr()

At this point, we have the sk_buff and we can make the tracing decision. If we wish to trace, we can write to tr->trace_buffer.data->ftrace_ignore_pid… Which is not in the exported headers, sadly.

I wanted this to be fully compatible with standard kernel headers, so accessing unexported structs is a bad idea. A workaround is to hook trace_ignore_this_task with a simple handler that returns 1 only for our traced task. Then we can call ftrace_filter_pid_sched_switch_probe, as if a scheduler switch has happened. It will update the tracing status for the current task — and since we forced trace_ignore_this_task() to return 1, the current task will be marked for tracing.

traced_task = No

def my_trace_ignore_this_task(orig, filtered_pids, task):
    return 0 if task == trace_task else 1

kp = kprobe("trace_ignore_this_task", KP_ARGS_MODIFY, my_trace_ignore_this_task)

def my_pre_tcp_v4_rcv(skb):
    global trace_task
    # insert filtering here ...
    # to enable tracing:
    trace_task = current()
    ftrace_filter_pid_sched_switch_probe(global_trace, False, None, current())
    return tcp_v4_rcv(skb)

Funny anecdote: ftrace won’t allow placing hooks on ftrace own code (such as trace_ignore_this_task). But you can place kprobe hooks, which are sometimes backed by ftrace, but in this case, it’s useful they are not 🙂

Using these 2 hooks, I can enable filtering based on the input sk_buff. The final snippet is available here.

It can be pasted into the Python shell in paste mode to enable the hooks. Finally, to start tracing we’ll do the following:

# cd /sys/kernel/tracing
# echo 2 > set_ftrace_pid
# echo tcp_v4_rcv > set_graph_function
# echo function_graph > current_tracer

2 is the PID of kthreadd — we need to write something into set_ftrace_pid, so PID-based filtering is enabled. This kernel thread is mostly asleep (it’s the thread from which new kernel threads are forked. As it isn’t scheduled in often, the chances it’ll handle any software interrupts are rare).

That’s it! After enabling ftrace and sending the to-be-filtered packet, I received exactly one trace in the trace buffer! It took some time, but this trace sure was helpful in helping me understand the relevant flow.

What we can take from here, besides the new ability of dynamic filtering, is this: As long as you know exactly what you seek, you can do it in Linux, even if it’s not supported in the code. All it takes is a bit of digging in the code, and perhaps adding a tiny patch here and there…

Note: The idea presented here – and the kernel Python itself – are both experimental. Do NOT try them on a production system. They are given as research & debugging tools for use on test machines only!

[1] Debugging the kernel using Ftrace — https://lwn.net/Articles/365835/

[2] Ftrace: The hidden light switch — https://lwn.net/Articles/608497/

[3] ftrace: trace your kernel functions! — https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/

[4] Strictly speaking, this can be implemented by pinning a RX flow to a specific CPU (and its ksoftirqd/cpu thread). But I wanted a more generic solution.

[5] In Linux, software interrupts are a set of common, high priority jobs executed regularly in response to certain kernel/hardware events. Sometimes they complete work deferred from a hardware interrupt, but not necessarily. You can read more about them here: https://lwn.net/Articles/520076/.

[6] Per-CPU variables are variables that have multiple copies, one for each CPU of the system. Kernel code can easily access the percpu instance for currently executing CPU. If an object can be logically bound to a CPU, then defining it as a percpu variable will remove all locking needs, because each CPU has its own copy. This has many other uses, and you can read about its kernel implementation here: https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html.

[7] https://github.com/micropython/micropython/pull/5482

Dynamic ftrace Filtering

Yonatan Goldschmidt

Principal Engineer, Intel Granulate

ftrace

Existing filters?

Tweaking ftrace Filtering

Save on cloud costs.