Back here, I was trying to understand the receive flow of a specific TCP packet (which functions does it go through).
The TCP receive entry-point function is tcp_v4_rcv
. We can function_graph
it, but we’ll soon encounter a problem — there are heaps of data generated by graph-tracing this function: On my desktop, under a not-so-heavy network load, I was getting a 5MB/s trace file! That’s a lot of data.
There are 2 limitations pointed out here:
- The calls didn’t undergo any filtering — all calls were written to the trace file. I wanted to see the graph of
tcp_v4_rcv
on a specific packet, not on all TCP packets processed by the system.
- There’s no information attached to the calls (for example, function arguments). Take a look on the sample graph above — there are only function names and durations. How can we differentiate the “interesting” calls?
Existing filters?
Let’s seek help from existing ftrace features. For this purpose, we can start with the very long documentation.
ftrace can record the current PID/TGID and the command (/proc/pid/comm
), and report them in the trace file. As far as I can tell, no more specific details are available to be recorded on each entry.
ftrace also supports filtering by PID: Using the set_ftrace_pid
file, we can limit the traced calls to specific processes. For example, this will limit traces to the init
process:
# cd /sys/kernel/tracing
# echo 1 > set_ftrace_pid
Sounds nice, eh? For many filtering use-cases, this is just enough: A simple program that writes its PID to set_ftrace_pid
and then invokes some kernel code you wish to trace.
In the network RX case, it’s impossible⁴: Network packets are processed in software interrupts⁵ running possibly on any CPU, in the context of virtually any thread of the system. There goes filtering by PID…
I’m not aware of any other sophisticated filters allowed by ftrace. We can try to filter out the events externally (e.g in the case of network RX — filter away unwanted packets with iptables
, this way they won’t end up in tcp_v4_rcv
)…
… or, we can tweak ftrace to provide the filtering we need!
Tweaking ftrace Filtering
All following Linux references are based on kernel 5.0+.
What I had in mind: ftrace has its PID-based filtering. I can provide external filtering, and after a call, I wish to trace passes my filter, the PID of the currently running process can be “added” to set_ftrace_filter
, effectively making the process traced (and I’ll get the call graph I desired)
What happens when you write to set_ftrace_pid
? A quick search of this string yields ftrace_init_tracefs
which registers the tracefs
file; And ftrace_pid_write
actually implements the write.
ftrace_pid_write
is rather long for its seemingly simple purpose. Summing it up, this function does 3 things:
- Creates a
trace_pid_list
struct from the input PIDs (you can write multiple PIDs for tracing), and stores it for later use.
- Registers a
sched_switch
event handler called ftrace_filter_pid_sched_switch_probe
.
- Updates tracers that use PID filtering to have
ftrace_pid_func
as their tracing handler.
Effectively, the 3rd step enables filtering. ftrace_pid_func
is very simple: It is called instead of the original tracing handler and checks whether the current task should be traced or not, by reading tr->trace_buffer.data->ftrace_ignore_pid
. Iff tracing should be performed, it calls the original handler.
Let’s see where this field is updated. ftrace_pid_write
has previously registered ftrace_filter_pid_sched_switch_probe
for the sched_switch
event. This handler will be called for each task switch in the system. It knows which is the next task to run.
static void
ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
struct task_struct *prev, struct task_struct *next)
{
struct trace_array *tr = data;
struct trace_pid_list *pid_list;
pid_list = rcu_dereference_sched(tr->function_pids);
this_cpu_write(tr->trace_buffer.data->ftrace_ignore_pid,
trace_ignore_this_task(pid_list, next));
}
It calls trace_ignore_this_task
to see whether the new task’s PID was marked for tracing:
bool
trace_ignore_this_task(struct trace_pid_list *filtered_pids, struct task_struct *task)
{
/*
* Return false, because if filtered_pids does not exist,
* all pids are good to trace.
*/
if (!filtered_pids)
return false;
return !trace_find_filtered_pid(filtered_pids, task->pid);
}
The result is written into the ftrace_ignore_pid
field, itself a percpu⁶ variable. This is a caching optimization: instead of filtering PIDs through the list whenever a task hits a traced function, the check of a PID is done once on task switch, and later the tracing decision is a mere cache read of this single boolean. Neat, and a very nice use of percpu as well! The result will be automatically invalidated upon the next call to ftrace_filter_pid_sched_switch_probe
, which will write a new value for the next task.
I think we know enough. If we can update the percpu ftrace_ignore_pid
, we can control whether tracing is enabled for the current CPU. If we can set it once we want to start tracing and unset it once we want to stop, we can use it to limit filtering at will.
So, I started writing a kernel module to work this out. Shortly enough, I recalled I worked on a MicroPython port for the Linux kernel⁷ just for such occasions! Using it, we can quickly define a few hooks that’ll do the job.
Our goal: given an input TCP packet that matches a precondition (say, destination port is 9000) we’d like to get its ftrace call-graph of tcp_v4_rcv
.
We’ll start with a hook that’ll do the actual filtering on the packets. It has to be called before ftrace-graph, otherwise, we won’t be able to control the filtering! If I hook tcp_v4_rcv
itself with ftrace, it runs after the ftrace-graph filtering decision, which is too late for us. I’m sure there’s a way around it, but I didn’t bother checking — instead, I hooked the tcp_protocol->handler
pointer, which is used to call tcp_v4_rcv
. This way, we’re surely called before it.
def my_pre_tcp_v4_rcv(skb):
# TODO: do the filtering stuff...
return tcp_v4_rcv(skb)
cb = callback(my_pre_tcp_v4_rcv)
net_protocol_s = partial_struct("net_protocol")
net_protocol_s(tcp_protocol).handler = cb.ptr()
At this point, we have the sk_buff
and we can make the tracing decision. If we wish to trace, we can write to tr->trace_buffer.data->ftrace_ignore_pid
… Which is not in the exported headers, sadly.
I wanted this to be fully compatible with standard kernel headers, so accessing unexported structs is a bad idea. A workaround is to hook trace_ignore_this_task
with a simple handler that returns 1
only for our traced task. Then we can call ftrace_filter_pid_sched_switch_probe
, as if a scheduler switch has happened. It will update the tracing status for the current task — and since we forced trace_ignore_this_task()
to return 1
, the current task will be marked for tracing.
traced_task = No
def my_trace_ignore_this_task(orig, filtered_pids, task):
return 0 if task == trace_task else 1
kp = kprobe("trace_ignore_this_task", KP_ARGS_MODIFY, my_trace_ignore_this_task)
def my_pre_tcp_v4_rcv(skb):
global trace_task
# insert filtering here ...
# to enable tracing:
trace_task = current()
ftrace_filter_pid_sched_switch_probe(global_trace, False, None, current())
return tcp_v4_rcv(skb)
Funny anecdote: ftrace won’t allow placing hooks on ftrace own code (such as trace_ignore_this_task
). But you can place kprobe
hooks, which are sometimes backed by ftrace, but in this case, it’s useful they are not 🙂
Using these 2 hooks, I can enable filtering based on the input sk_buff
. The final snippet is available here.
It can be pasted into the Python shell in paste mode to enable the hooks. Finally, to start tracing we’ll do the following:
# cd /sys/kernel/tracing
# echo 2 > set_ftrace_pid
# echo tcp_v4_rcv > set_graph_function
# echo function_graph > current_tracer
2
is the PID of kthreadd
— we need to write something into set_ftrace_pid
, so PID-based filtering is enabled. This kernel thread is mostly asleep (it’s the thread from which new kernel threads are forked. As it isn’t scheduled in often, the chances it’ll handle any software interrupts are rare).
That’s it! After enabling ftrace and sending the to-be-filtered packet, I received exactly one trace in the trace buffer! It took some time, but this trace sure was helpful in helping me understand the relevant flow.
What we can take from here, besides the new ability of dynamic filtering, is this: As long as you know exactly what you seek, you can do it in Linux, even if it’s not supported in the code. All it takes is a bit of digging in the code, and perhaps adding a tiny patch here and there…
Note: The idea presented here – and the kernel Python itself – are both experimental. Do NOT try them on a production system. They are given as research & debugging tools for use on test machines only!