Mastering Performance Optimization

Performance optimization is the process of modifying a system to amplify its functionality, thus making it more efficient and effective. Pretty much everything today can be optimized, and performance optimization is necessary—from the frontend-facing application that runs on the client side to the invisible infrastructure that powers distributed systems.

There are two major goals for optimizing the performance of any system: speed and efficiency. We optimize to make systems do more work per time run or optimize to use available resources more efficiently.

Metrics Deep-Dive

There is one key element to proper observability: metrics. You can think of metrics as data that provide insights and visibility into the health and behavior of your application. Metrics are the basic values you need to correlate different factors, understand historic trends, and measure changes in consumption, performance, and error rates.

Without metrics, there is no visibility, and without visibility, performance optimization is impossible. When combined with alerting, metrics can be very powerful: You can configure rules and take action when metrics fall outside a given range, trigger a notification, or automatically add more resources (autoscale).

But knowing what metrics to collect can be challenging. The “right” metrics largely depend on your objectives and SLAs. There is no one-size-fits-all metric for all application types and workloads, but metrics can be grouped into the following three main categories: work, resource, and event.

Work Metrics

Work metrics indicate the health of a system based on its output. These give you an understanding of the system’s internal state as well as throughput, success, latency, and performance.

Resources Metrics

Applications rarely function with a single component. Instead, multiple components work together to make a system work. In a production system, these include low-level (CPU, memory, disk) and high-level components (database, third-party services).

Resource metrics target resources a system needs to do its job. They give a detailed picture of the system’s state in its entirety and are important for understanding saturation, errors, and availability.

Event Metrics

A live system generates events—actions or activities happening within the system. Event metrics map changes in the system’s behavior over time and are mostly used to create alarms, alerts, and notifications. For instance, you could fire an event whenever your system fails to process a customer payment.

Common Performance Metrics

Common top-level performance metrics include uptime, memory, CPU utilization, response time, throughput, load averages, lead time, and error rates. In the following paragraphs, we’ll take a more detailed look at some of these.

Uptime

Uptime measures the shortest possible time it takes a system to restore from any downtime, giving you the availability of a system. Although it’s unrealistic to attain 100% availability of an application in a production system, the goal is to maximize uptime and reduce downtime.

CPU Utilization

This estimates the sum of work carried out by a CPU, allowing you to determine the number of resources your application utilizes and gauge the system’s performance. CPU utilization varies based on type and the amount of managed computing tasks, and type of application you’re running. Some certain tasks and applications require heavy CPU usage, while others have any CPU resource requirement. For instance, API gateways require higher CPU usage.

Note: This is probably one of the most misunderstood metrics and can be misleading. We’ll discuss why in a later section.

Throughput

Throughput represents the maximum amount of work a system handles per unit of time and is best tracked as requests per minute (RPM). Monitoring throughput gives DevOps an idea about the breakdown of the complete work process and how newly installed features are affecting the system’s ability to handle requests. A drop in throughput indicates a bottleneck, preventing consistent delivery results

R&D teams track the above metrics to understand what’s going on in their systems and identify which system resources are consuming what. Errors can then also be easily traced to their root cause and resolved.

Large systems generate tons of metrics every day, so you need to know which metrics are relevant to performance. In the subsequent sections, you’ll learn about metrics for different workloads that provide better visibility and clarity on how your system is performing.

Collecting the Right Performance Metrics

Due to the uniqueness of each application and workload, its impractical and ineffective to collect the same metrics for every system, as performance metrics for each workload and application type differ. That’s why you need a framework that defines what to track when monitoring, say, messaging and streaming applications, application server components, or a database server.

There are two famous frameworks used for monitoring: the four golden signals of monitoring explained in the highly influential Google Site Reliability Engineering book and the USE Method.

While we will not delve into the golden signal approach in this post, we will discuss the USE Method briefly and show how it can be applied to database workloads.

The USE framework was originally developed by Brendan Gregg to track saturation, utilization, and errors for every resource, including all the functional components of a physical server (busses, disks, CPUs, etc.). This provides an understanding of:

The average time a resource is engaged, doing work
The extent to which the resource has more work than it can service or handle, often queued
The frequency of errors the resource generated

Let’s assume you have been tasked to optimize the performance of a database server. Instead of following some common anti-patterns, like changing things randomly until the problem goes away, you can leverage the USE Method to collect the right metrics, which will in turn help you gain visibility into the server.

The performance of a database server deteriorates when it has more work than it can process at a given time. Incoming queries are then queued until the database has capacity to process them. So to address this, you start collecting metrics on its throughput. In addition, a database server also requires some low-level resources like disks that can get used up or corrupted. For this, you can measure resource utilization.

Of course, no system is immune to errors. Database operations generate error events when they fail, and tracking the number of errors generated is a good way to know when a database server operation is failing.

Why CPU Utilization is Misleading

One of the most commonly seen performance metrics is CPU utilization—considered one of the most-essential measurement tools when it comes to evaluating system performance. This is why scaling strategies are often based on CPU usage, where the decision to scale up or down depends on whether CPU utilization exceeds certain thresholds.

The problem with CPU utilization is that it’s not a measure of how busy a processor is but rather gives you a combination of stalled (waiting) time plus busy time. In a production system, measuring CPU utilization to track performance can be misleading since some CPU tasks are I/O bound.

For instance, when a CPU is running at 90% load, it’s assumed that it’s busy. But in reality, 50% of the CPU could be stalled and waiting for input or output operations to be completed and only 40% is doing real processing work.

PMCs & IPC: a Better Bet

Understanding how much of your CPU is stalled (not making progress) can help you make better performance-tuning decisions. Rather than collecting CPU utilization metrics, tracking Performance Monitoring Counters (PMCs) and Instructions Per Cycle (IPC) can help you better understand your system’s performance.

PMCs are bits of code that count, measure, and monitor events that occur at the CPU level, like the number of accesses to off-chip memory, instructions or cycles that a program executed, and the associated cache misses. PMCs track these events, giving you insight into the behavior of your infrastructure. This is one reason PCMs are considered a valuable tool for debugging and optimizing application performance.

On the other hand, Instructions Per Cycle (IPC), commonly known as commands per cycle, gives you the average number of instructions executed per clock cycle—a measurement that helps you understand the number of tasks a CPU can conduct in a single cycle. An IPC of less than 1 is indicative of a memory stall, meaning you should reduce memory I/O and improve memory locality and CPU caching. You would also benefit from tuning your hardware, such as using faster memory, busses, and interconnect.

On the other hand, an IPC of greater than 1 likely means your task is heavy on CPU, so optimizing and reducing your code execution time will be a better performance-tuning decision.

Both tools go a step further in helping you understand what’s happening in your CPU and allow you to effectively optimize your system or application.

Wrapping Up

In this article, we explained different metrics that help provide visibility into a system’s performance. Monitoring these key metrics can help you optimize a system’s health and functionality. For different workloads and applications, it is important to track the right performance metrics, which we introduced in this article.

While CPU utilization is one of the top metrics in most performance monitoring tools, it does not measure how busy a processor is. Better alternatives like PMCs and IPC give you a clearer picture of CPU utilization and guide you in making better performance-tuning decisions.

Organizations need to constantly manage and optimize the performance of their systems, and the right metrics are necessary to gain the proper insight to make these decisions.