Why profiling is amazing
So why is profiling amazing? It’s not a new concept, but it has definitely been rising lately, so it’s getting easier and easier to apply and use, even in production environments. Profiling gives you visibility into which parts of your code consume the most resources. And this helps to expose interesting performance improvement opportunities.
Deterministic Python profilers
Let’s talk about profiler types, and focus on Python profilers. We’ll start with deterministic profilers, or tracing profilers. They track your program’s execution in a deterministic way, for example, by instrumenting all code paths, to give deterministic results.
They are very common and many types of profilers exist. Probably the most well-known one is cProfile, which is included in the Python standard library. To the right of this slide, we can see its example outputs.
Deterministic profiler limitations
Now, deterministic profiles are very useful during development, as they are very versatile and can give accurate metrics at the function and line-of-code levels.
However, their intrusive design (the need to insert instrumentations in code or in the interpreter) means they can possibly introduce high overhead to the code execution.
They also might require code changes, for example, to enable or disable the profile. Or require deployment changes; for example, you need to start your application with the profiler script.
These reasons make them less suitable for production use because you do not want to introduce any overhead, and you preferably don’t want to make any changes just for the sake of profiling.
Statistical Python profilers
Another profiler type is statistical profiles. These work by taking snapshots (or “samples”) of your application every set interval; for example, every 1 millisecond or every microsecond, instead of continuously tracking everything that’s happening. Over enough time, the accumulated samples portray an accurate image of your application.
One common example is py-spy, which is a sampling profiler written in Rust. This image also shows one way to visualize the output of py-spy. It’s called a “flame graph” and tells us the relative execution time of different functions and flows in your application.
Statistical profiler limitations
Since these samples can be taken externally, these profilers can be made external to the application, meaning they’re not intrusive, thus they do not introduce overhead to the application itself (to some extent). The profiler itself is a program running on the system, so it does include some extra overhead on the system. We’ll talk about that overhead later when we reach eBPF.
Since it’s not intrusive, we don’t need to make any changes to the code or deployment. For example, py-spy can start profiling any running Python process just by giving it the process ID, which is very convenient. For this reason, statistical profilers are much more suitable and safe to use in production.
Deterministic profilers are generally more versatile in their abilities. So for development environments, when you want to accurately measure a specific function or module, you might want to still use them.
About eBPF
A primer on eBPF: It’s a technology that has evolved from the old Berkeley Packet Filters, released in 1992, which is a mechanism in the kernel that allows the user to define “filter programs” for sockets, like the one displayed on the screen. It was used mostly for sniffing programs, such as tcpdump. The filter program is essentially a small “virtual machine” with a set of opcodes and operations that it can perform on packet data.
For example, the program displayed here checks if the packet’s source IP address or destination IP address is the local host and if the source port and dest port is 80. And you can see on the monitor the assembly instructions, the eBPF assembly instructions for that program.
Now, jumping forward to 2022, this simple interpreter for user programs has been enhanced with many more APIs, and those are not limited to packet inspection anymore.
Also, the programs can now be attached to virtually any logical point in the kernel, not just to the entry of packets. Together, these make eBPF the most capable tracing or observability infrastructure in Linux.
eBPF – low overhead, system-wide tracing
Here’s a short example: To the right, we have the code of an eBPF program called “opensnoop.” It’s written in a language called bpftrace, which is later compiled to the same eBPF assembly we saw. You can read about bpftrace online. The program “hooks” onto the open() system call and thus intercepts all open() calls throughout the system.
To the left, you can see a sample output from running it on my box. You can see all sorts of different PIDs and programs opening different files. You can see how relatively easy it is to write this simple code that attaches onto syscalls and traces all calls throughout the system. And also, I didn’t mention the negligible performance effect, which is something that we just didn’t have before eBPF.
eBPF: A new type of software
This table describes the differences between standard user code, kernel code, and eBPF.
The core thing you need to take from this is that eBPF is safe by design: A “verifier” mechanism exists to ensure that only safe programs execute. This also means that eBPF programs are not entitled to do anything they please, for example, they are not able to call arbitrary system calls or perform arbitrary writes to memory.
On the other hand, eBPF programs have fast access to resources such as memory. For example, they can access the memory of the currently running Python application much faster than py-spy, which needs an external application that has to call the system; so it needs to run some system calls in order to read the memory of the Python application.
Python profiling in eBPF
Now let’s get back to Python. We needed a low-overhead sampling profiler that could sample at high frequency and could easily profile all Python applications running on the system. Plus, we wanted it to be able to extract native stacks and kernel stacks.
Py-spy, while not introducing overhead on the application itself, does have some overhead on the system: As I said, it needs to access the Python memory in order to extract stack traces, and it does use a lot of syscalls in trying to do that, which take time. Py-spy simply wasn’t fast enough when we needed to profile a large Python application with hundreds of threads at high frequency.
The eBPF approach
So we started looking into the eBPF approach and quickly found PyPerf (https://github.com/iovisor/bcc/pull/2239), which was posted to BCC as a PoC of an eBPF-based Python profiler. By the way, we also found a project called rbperf, which is like PyPerf for Ruby, but that’s a different story. So we spent a while and added many new features to PyPerf, trying to make it the best Python sampling profiler.
First of all, we made sure it supports all currently available Python versions. We made it a system-wide profiler; that is, it profiles all running Python applications on the system, unlike py-spy, which works on a per-process basis. If I want to profile 50 Python processes, I need to invoke 50 different py-spys, which then introduce more overhead. With PyPerf, I need to do it just once, and it profiles the entire system.
Additionally, we added logic to extract native stacks such as Python extensions, for example, JSON, pickle or numpy, interpreter code, and native libraries, and we also extracted kernel stacks, which can be, for example, the system calls your application is making.
These new features were relatively easy to add over eBPF because PyPerf is eBPF-based, and it would have been much harder, if not impossible, had it been written on a non-eBPF base.
So here’s an example of how it looks. This is a simple gunicorn application. In yellow rectangles, we can see the Python frames from the Python application. The purple frames denote native code, and the orange frames denote kernel code. Together, the combinations of those three portray a very accurate image of the application’s execution.
Native code matters!
Now, I’ve been speaking a lot about native code, which is something that many profilers overlook intentionally, saying that the developer should only care about the Python code because they do not have control over native code anyway. So they say you should just focus on the Python code and that the native frames and stacks are unwanted noise.
However, from our experience, we know that taking the native profile into account is very important when you want to truly understand what’s going on and which operations on the Python level are taking up the most CPU and time. Therefore, we have invested in making this feature work perfectly in PyPerf.
So now we’ll do a small exercise. I have this function written here. Can you read its code and guess which of its operations take the most time? I’ll give you a minute to think, and we will check out the results. Actually, it’s recorded, so you can just pause and continue when you’re ready. I’ll continue now.
So here I’ve cut out the relevant native profiles of this function. The bottom-most frame is the Python function itself, and all the frames above it are the native functions that our Python function—func, I’ve named it—is calling. I’ve added some arrows to explain which is coming from where, and we can see something that I originally did not expect to see in the profile.
For example, I didn’t expect that the string concatenation, which we can see to the right, would take up such a relatively large part of the profile; yet it actually ranked first. Beyond the large part taken up by string concatenation, I did not expect the chr() calls to take up a large part of the profile. Also, the modulo operator takes up a relatively large part of the profile. And I only realized that once I looked at the native profile.
How to improve Python code
What I’m trying to tell you by that is that once we observe the native profile of even a simple Python function, we can quickly devise ideas for how to improve the Python code a bit. For example, after viewing this profile, I now know that the most important optimization to use is to switch from string concatenation to StringIO. And after doing that, the next thing I would do is probably cache the results of chr(), and after that, I would try to avoid the modulo operator.
Now, the compression, which I thought would take a large part of the profile, actually takes almost nothing. You can see it in the middle; it’s actually rather small. So you need to profile—you need to look at the native profile in order to truly understand how even a simple Python function divides its execution time.
PyPerf is a part of Intel Tiber App-Level Optimization’s Profiler
So that’s it on PyPerf. I hope the last part was interesting. PyPerf is a part of Intel Tiber App-Level Optimization’s profiler, which is our system-wide, continuous profiler for production environments. It supports numerous runtimes, not only Python, but also Java, Go, Rust, and Ruby. So check it out—it’s open-source.
Thank you. Feel free to DM or connect on LinkedIn or GitHub (/jongy) and please try our code profiler —it’s fun.
To start free profiling with the open-source profiler, visit profiler.granulate.io
This is a transcript of Low Overhead Python Application Profiling using eBPF. You can find the full video here.