Back to blog

The Scale Trap: How We Reduced CPU Utilization by 80%

Omer Mesika

Director of Solution Engineering, Intel Granulate

This is a transcript of The Scale Trap: How We Reduced Our CPU Utilization from 80% to 15% from DeveloperWeek Cloud. You can find the full video here.

Hey everyone, and thanks for joining my session. My name is Ezequiel Nulman, and I’m the R&D Team Lead here at Granulate. Happy to connect with you beyond this session on LinkedIn. We’ll have time for Q&A afterward. Feel free to ask anything in the chat throughout this session, and we’ll address it then.

Why is it important to optimize at scale?

Let’s start with why it’s important to optimize at scale. Optimization can reduce CPU usage, leading to cost reduction. SLA improvement provides better product quality, and at scale, optimization can have even more significant effects.

About profiling

Profiling is the dynamic inspection of running programs. You use it to aid in optimization or debugging. There are many types of profiling, for example, CPU profiling, which profiles the CPU execution time of code; memory profiling, which profiles memory usage, allocations, etc.; or networking, I/O, and more.

Reading profiling data: flame graphs

How do we read profiling data? One common way is flame graphs. Flame graphs are a way to visualize stack traces. We have the y-axis, which is the stack trace of called functions, and the x-axis, showing the relative samples that include this stack. Note that it’s not the number of calls to the function.

We’re sampling, not tracing. You can read more about flame graphs at the rbspy page (, and there’s very detailed information about them at Brendan Gregg’s website (–he’s the creator of the flame graph.

Our metrics stack

So let’s talk a bit about why we need our metrics stack.

Performance optimization requires us to get time-series metrics to our stack. In the middle of the metrics streams, we use Fluent to ingest data into TimescaleDB. The metrics come over TCP, so our input rate is very high. And we use TimescaleDB to store our metrics.

TimescaleDB is a time-series SQL database providing fast analytics and scalability, with automated data management as an extension of PostgresDB (for example, table partitioning for scale). In this session, we won’t dive deeper into TimescaleDB, but I would recommend you go ahead and look into it yourself.

What is the issue?

The issue is that, while the code worked and was very efficient at low scale, after a while, and when handling more and more instances or traffic, this part of the code proved to be a bottleneck. That’s why it’s very important to continuously monitor your environment and make sure it’s working as planned. Ideally, we’d also get an alert when something changes drastically or when something is a clear issue—all future features of Granulate’s continuous profiler.

How do we stream data?

We want to understand what is running on every server and how we stream data to TimescaleDB. Every server (of type c5.large) in our workload runs on a Go TCP server. Once data comes over the network load balancer (NLB), our Go servers write to a file. Fluent then collects the data, creates a batch, and immediately flushes it to TimescaleDB.

What is Fluentd

So let’s talk a little bit about Fluentd.

Fluentd is a technology that streams events with the ability to buffer data with a retry mechanism. You can even configure secondary output in the event that the write fails. To do that, you just need to configure input and output types and do a bit of tuning. Don’t forget to configure buffering for persistent data.

We developed our Fluentd output plugin to TimescaleDB, similar to TimescaleDB open-source outputs, but with our specific needs in mind. We use in_tail input to read events, which is very similar to tail -f.

Important considerations about Fluentd

To understand more about Fluentd, there are a few important things to consider: Tail input does not support multi-process workers because in_tail cannot be implemented with multi-processes, meaning Fluentd will run over one core. Output plugins, in addition, implement the write function. The input is a chunk, and chunk is a batch of records.

So how did we do it?

First of all, let’s see how our data looks. We write our data into a file in JSON format (as you can see here in this slide, which is an example with demo data), where we define table name in every record and table_name is the table to which we want to ingest our record. Fluentd creates and buffers chunk data by table_name.

We ran our continuous profiler in our workload production servers and immediately see a flame graph of our processes running at a high scale. You can see that the number of samples here, out of total samples, is 78.55% of CPU usage. So 78.55% of CPU usage over this period of time was Ruby frames, and most of this was on output of TimescaleDB, as you can see here. And here, in red, we can see the input tail stack at 22%.

Let’s try to understand what’s wrong in our TimescaleDB output. First of all, let’s filter all frames that contain the word “write,” as you can see here, indicated with the color blue. 

  1. The first frame we find points to Line 88; that’s the end of the write function definition.
  2. The second frame (called block in write), Line 87, points to the end of the connection “pool block.”
  3. The next frame, called block (2 levels) in write function, Line 81, is the end of the block of messages from the chunk iteration.
  4. And this next frame, called block (3 levels) in write function, Line 78, is the end of the block of values for every record iteration.
  5. And finally, the last frame, called block (4 levels) in write, Line 77, is just the parsing of every value in the chunk. This last frame, in addition, indicates that most of the CPU for Fluent is parsing every field in the dictionary.

Data format after

We decided to avoid any manipulation of the metrics in the Fluentd output. So we created a JSON format with two keys, as you can see in this example on the slide.

  1. table_name
  2. Data, where the data is a batch of events

You can also see our write function after a refactor. Just note that here we’ve appended every chunk of messages to a list, and then finally we join all the chunks with a comma.

After deploying our changes in production, we see amazing CPU improvements in our servers, reducing the CPU usage of Fluentd. You can see it in the profiler; there’s been a decrease from 78.55% to 13.61% of CPU usage. Just look at that… Go is now following our updates.

Use Case Follow Up

So what’s happening now? We’re getting data over some 100,000 servers, and we’re processing 1M metrics per second with a workload of 20 servers of type c5.large.

The next step is that once we have a workload running at this scale, we want to achieve better orchestration by Kubernetes cluster and stream data to kafka to improve data streaming.

Thank you very much to all for listening. Now over to the Q&A. If I don’t get to answer all your questions now, we’ll make sure to reach out via DM.

This is a transcript of The Scale Trap: How We Reduced Our CPU Utilization from 80% to 15% from DeveloperWeek Cloud. To start free profiling, visit

Optimize application performance.

Save on cloud costs.

Start Now
Back to blog