We’ve reached an age of compute abundance; Abundance leads to waste. This isn’t a personal reproach: most organizations using cloud infrastructure are running at sub-par utilization, wasting huge amounts of IT budget on idle, oversized or non-productive cloud resources.
To put things in perspective, Gartner estimates global cloud spend for 2019 around $206 billion, of which $39.5 billion will be spent on IaaS. Typically, two-thirds of IaaS resources are dedicated to compute, which is particularly vulnerable to underutilization.
That’s one of the reasons that a recent RightScale State of the Cloud survey has found that cloud cost optimization is the top initiative for companies in 2019 for the third year in a row, increasing to 64% from 58% in 2018.
It’s worth noting that this is not a top IT initiative but a top company initiative, as cloud cost management is one of the few challenges that not only deeply affects engineering and IT teams, but also the company’s bottom line.
Underutilization is on the rise due to the convergence of three parallel trends:
- A lack of governance
- the imperative for developers to move fast and,
- overall, the increasing complexity of systems and functions.
As a result, cloud compute services often run at a utilization as low as 10%-40%. In addition, cloud users often underestimate their amount of wasted cloud spend, placing it at 27% in 2019. A Flexera survey measured the actual waste of cloud costs at 35% in the absence of effective management.
Manual, AIOps, CSEM
There are many ways to skin this cat.
Manual optimization by cloud consultants or engineers is far from immune to frequent code changes. Performance engineers manually conduct point in time performance/cost analysis, but code is updated so frequently today, that their recommendations quickly become out-dated. With CI/CD processes in place, across all their services many organizations have at least a few deployments every day. Also, manual optimization cannot take into account the impact of workloads on daily, monthly or annual seasonality.
Some optimization can be achieved through the use of advanced AIOps solutions. AIOps monitor IT performance to both identify anomalies and consolidate incidents around a root cause. Taking a bird’s eye view to the development of AIOps technology, Gartner predicts that within the next five years this category will mature to reach the “Act” stage, enabling it not only to alert but to autonomously remediate and streamline operations.
Currently, however, the benefit of AIOps tools comes from watching cloud performance and then providing feedback for admins. This goes a long way towards reducing noise and providing causality, but still leaves the lion’s share of optimization to manual operations. The volume of parameters, changing IT environments, and frequency of code pushes makes it impossible for a human to optimize a system in a useful timeframe.
Cloud Service Expense Management approached the problem through cost control. Since the service breadth and frequent price changes of cloud usage make it challenging to track, CSEM normalizes results across platforms and map cloud resources to specific owners and teams. This enables finance departments to allocate spend to particular products or business units. However, CSEM software doesn’t cover full cloud functionality and operations, often failing to support reserved and spot instances, chargebacks, ongoing rightsizing, etc.
Untapped Cloud Cost Optimization Potential
Implementing one or all of these approaches still leaves a vast untapped optimization potential, that goes beyond cloud and workload management solutions. Tapping this potential requires taking a completely new approach, real-time kernel and OS-level adaptations.
Low-level server optimization uses AI for continuous real-time adaptations of all the kernel and OS processes. Using AI to identify the optimal combination of relevant OS and kernel resources and parameter settings based on actual application demands is key to boosting an application’s performance. Research shows that low-level server changes can lead to an improvement in request-response time and increased throughput, which can translate in many cases to drastically reduced compute costs.
Optimizing allocated resources in the kernel and OS for the efficient frontier between performance and cost in real-time achieves a dramatic increase in utilization and throughput and a significant decrease in response time. The results are immediate and palpable: by enhancing server performance (increasing utilization) for existing cloud resources, compute costs can be driven down by more than half.
Cloud cost management is a challenge that isn’t going away anytime soon. In fact, Gartner foresees that “through 2020, 80% of organizations will overshoot their cloud IaaS budgets.” AI-driven real-time low-level server optimization takes the bull by the horns by maximizing performance as close as possible to the machine and creating the ultimate optimization effect that ripples across the full IT stack.