How CI/CD is Sidetracking Optimization, and What You Can Do About It
High-velocity code changes are making it impossible to optimize infrastructure. But not all is lost in the battle for improved performance.Read more
Tom AmsterdamMar 8, 2021
“Can we add more people to get the job done?” was a question I heard frequently when I worked in construction. Now, in the world of tech, I hear about needing to “add more people” or “add more resources” all the time. For growing businesses rooted in technology, there will always be a time when your infrastructure will experience more load than it can handle, more users than it can serve, or more processing jobs than it can process. At this point, you’ll notice symptoms like application crawls, connection timeouts, jobs that take forever to complete, and, in some cases, being thrown offline. All such forms of performance degradation will be shown by your monitoring systems. The question is: Can you add more resources to get the job done? In other words, can you scale?
Scaling is the ability of IT resources to handle more or fewer workloads by either increasing or decreasing resources. Thus, scalability is a system’s ability to handle dynamic shifts in customers, requests, or workloads, but the path to achieving scalability is not simple. It involves a deep understanding of many variables (metrics) and responding appropriately when these variables change. In this post, you will learn why scaling is challenging, different strategies for scaling applications, and how to use the right techniques and metrics to scale your workloads.
If your application is running on a Virtual Machine (VM) or Kubernetes pods, there are two main methods you can use to create a multi-level scaling architecture: vertical scaling and horizontal scaling.
Vertically scaling is achieved by adding more resources to a system—usually more memory, more disks, more RAM—or by upgrading to a more powerful server. I like to think of vertical scaling as similar to adding stories onto a building: If you need more rooms, you simply add additional floors to an existing building vertically. This is known as scaling-up. Vertical scaling is commonly used for applications that are difficult to distribute across multiple compute servers. Note that it is not limited to the instance hosting your workload, as you can also scale a pod vertically by adding more resources to the pod as needed.
Horizontal scaling, on the other hand, means exactly as it sounds. Unlike vertical scaling where you keep adding resources to an existing system, here, you simply add more virtual machines or pods to your pool of resources. So, rather than a single powerful machine serving users, you have multiple but smaller connected machines serving requests or processing jobs. In horizontal scaling, you can scale in (remove nodes or pods) or out (add nodes or pods).
Interestingly, scaling applications is not just a matter of scaling up or out at will; there are many variables to consider. In the next paragraph, I’ll share why scaling decisions are hard to make.
Capacity planning is an old problem. It’s been around since before the era of cloud technology and won’t be going away anytime soon. The fact that businesses can’t predict the exact capacity they may need in the future makes scaling an interesting challenge. Over-provisioning could mean you’re burning more cash than you actually need, while under-provisioning could mean you’re under-serving your users and delivering a lower quality of service. Demand is hard to predict; so is predicting capacity.
Applications in the 21st century are multi-tiered and exceedingly complex. A modern system relies on multiple services, technologies, and components to function. From the storage layer to the event broker, making sure your jobs are processed asynchronously, each element may require a different scaling strategy. For example, some workloads are queued-based or I/O-bound, while others may be compute-bound. The metrics for scaling technology A are often different from those needed for scaling Technology B. There are times when the most commonly used metrics won’t even help you track everything you need to monitor nor provide service at the level of quality your customers demand. You may have to turn to custom metrics, but how would you know which and what metrics to combine?
Furthermore, scaling is not a one-time thing. It’s continuous, taking into account different behaviors and events over time. Depending on the situation, you may scale in or out, or you may need to do some warm-ups. Seasonal peaks and surges will also affect your scaling decisions, with each decision having a direct infrastructure cost and impact. The wrong scaling strategy could cost your business more than not scaling at all.
All of these factors combined makes it hard to find one scaling metric that captures everything. There are some common strategies for scaling workloads, but these come with some drawbacks as well.
Businesses can approach scaling using a few common options. Here, I‘ll cover two approaches: schedule-based and CPU-based scaling.
In schedule-based scaling, you define a schedule as to when you should scale up or down. This strategy is based on times or periods, that is, the metric is time. Depending on your traffic pattern, this strategy allows you to configure scaling rules so that more capacity is added during peak periods. For example, you may add more resources on business days and remove resources on weekends. Scheduled-based scaling is fixed and a function of time and date.
The problem with this approach is that real-world events are hard to predict. Something as simple as a social media post in the middle of the night could drive traffic to your app. Due to such unpredictability of traffic patterns, most people over-provision resources to be prepared for unexpected events, which usually leads to higher costs.
You can also automatically scale based on CPU usage, adding more instances to handle an increased workload whenever utilization surpasses a given threshold. Likewise, during a cool-down period, the autoscaler will remove instances to save costs after CPU usage drops back below the threshold.
CPU-based scaling is reactive in the sense that you scale when you notice a lack of capacity. But there’s a slight delay before newly added resources are available and can serve traffic. This delay can lead to performance or service degradation for end-users and impacting your SLA. If you have an SLA of 99.9% uptime, this impact can be considerable. Also, CPU utilization is not a yardstick for measuring your performance against your SLA and cannot replace your SLA—your goal should be to scale out based on the SLA you have with your customers.
When you have a scaling problem, you have a resource contention problem. It’s not only about the number of users or jobs but the demands being made on limited resources, which end up having to be shared. Considering the problem from this perspective lets you tackle it more effectively. To overcome a scaling bottleneck, you can either reduce the demands being made on your resources or increase their capacity. But since the goal of every organization is to grow, you are in reality left with only one option: increase capacity.
When it comes to increasing capacity, there is no one-size-fits-all metric. Therefore, you should scale dynamically using metrics that are relevant to the characteristics and performance patterns of each workload. When you scale dynamically based on workload type, you will not only be able to optimize costs but also utilize your infrastructure more effectively.
Finally, the ultimate reason for scaling is to provide reliable service and improve the customer experience. Customers don’t usually care about how you run your services, nor do they care about your CPU utilization or memory usage. What matters to them is their ability to use your service fast and your ability to stay true to your SLA. While external metrics are great, it’s important to know they don’t drive customer satisfaction. You need to understand your customers and their needs and create metrics that ensure those needs are met. An example of such metrics may include average response time, load averages, queue lag, processing time, SLA threshold, etc.
It’s essential to note certain best practices that will put you on the right path when scaling.
Different components of a system may scale differently, plus, the more coupled your architecture is, the harder it will be to scale. When you separate concerns, you’re able to scale each piece of your infrastructure independently.
Adding capacity takes time, as new instances don’t boot up and become available immediately. The more responsive your scaling configuration is, the better. If you need N capacity, it’s a best practice to provision N+1 capacity, with the additional node providing a buffer in case of a sudden spike.
One of the most essential factors to scaling is understanding a system’s behavior. Having access to relevant data and metrics that correlate to the health and performance of a system is therefore crucial to scale, as they can help control autoscaling behaviors. Average, Minimum, Maximum, and Total are some of the metrics you can use to base your scaling decisions on.
In this post, we’ve learned about two ways to scale applications, either vertically or horizontally. We’ve also seen why scaling based on time or CPU is not effective and why you should scale based on workload types and your SLA. Finally, it’s important to note that scaling apps is not all fun and games; as Martin Kleppman said, and I quote:
“Building scalable systems is not all sexy roflscale fun. It’s a lot of plumbing and yak shaving.”
In the end, your goal should be to provide a reliable customer experience, and you should approach scaling exactly from that angle.