Cloud Optimization Best Practices for AWS, Azure, and GCP

What is Cloud Optimization?

Cloud optimization is the process of eliminating wasted cloud resources by selecting, configuring, and tuning resources for specific workloads.

For DevOps teams, cloud optimization means determining the most efficient way to allocate cloud resources, with the goal of reducing waste while improving performance in the cloud.

Depending on the application and specific business goals, organizations might define cloud optimization in different ways. But the common denominator of most cloud optimization strategies is the need to understand how the cloud architecture works today and what needs to improve to derive the most out of cloud investments.

This is part of an extensive series of guides about hybrid cloud.

In this article:

Why is Cloud Optimization Important?
4 Types of Cloud Optimization
Cloud Optimization with Well-Architected Frameworks
Cloud Optimization with Auto Scaling
First-party Cloud Optimization Tools

Why is Cloud Optimization Important?

Cloud optimization was traditionally the responsibility of IT operations teams. They are often caught between the requirements of finance departments who want to control cloud spending, and the needs of application owners who want their applications to have sufficient resources.

FinOps is an evolving cloud financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology, and business teams to collaborate on data-driven spending decisions. FinOps teams, sometimes known as Cloud operations (CloudOps), can help meet these conflicting requirements by maximizing cost-effectiveness of the cloud, while delivering optimal application performance for end users.

4 Types of Cloud Optimization

1. Cloud Cost Optimization

Cost reduction is typically the most important aspect of cloud optimization. A common problem that results in over-spending is allocating more resources to workloads than are actually needed. This problem is made worse by the fact that cloud pricing models are very complex—service charges vary by region, time, instance type, and deployment model.

One way to address this problem is to use native cost monitoring tools provided by cloud providers, such as AWS Cost Optimization Monitor and Azure cost alerts. These tools can help an organization understand where it spends the most, and warn operational staff about overspending.

However, cloud providers have a vested interest to have users spend as much as possible on their platform by consuming more services. They also do not provide tools that work across hybrid or multi-cloud environments. A successful cloud optimization strategy should not rely only on these native tools, but include different approaches and technologies, including third-party offerings that operate across different cloud architectures.

Optimize cloud with Intel Tiber App-Level Optimization

2. Cloud Performance Optimization

Performance optimization is the process of configuring and tuning applications and services so that they run as fast as possible.

Performance is a complex requirement that depends on many factors. An important aspect to consider is the design of the cloud architecture. For example, architectures that require frequent transfer of data between cloud regions or cloud providers can suffer from performance degradation. This is typically due to network bottlenecks and latency.

The type of cloud service you choose may also affect performance. Depending on the type of workload, a virtual machine (VM) may provide limited resource allocation. In this case, serverless deployment could provide better performance than a standard VM.

Even if your code does not directly interact with cloud components, the underlying efficiency of the code can have a significant impact on performance. The more efficiently applications run in a cloud environment, the fewer resources they will consume and, by extension, the lower the cost. Regularly test the performance of new application code before deployment.

3. Cloud Reliability Optimization

Cloud-based workloads can become unusable due to failures in the cloud hosting them or problems with the workload itself, most commonly application issues. Mitigate these risks to maximize application reliability.

Redundancy is an important strategy for optimizing reliability. A key strategy is to deploy multiple instances of the same workload in different clouds or in different regions of the same cloud to improve resilience. However, this type of protection can significantly increase costs. For best overall results, you need to balance your redundancy goals with your cost optimization goals.

4. Cloud Sustainability Optimization

As environmental sustainability concerns grow and vendors aim to be “green”, organizations are starting to be concerned about the carbon footprint of cloud workloads. To minimize the carbon footprint, an organization must understand the impact of architecture and configuration on the amount of energy consumed by IT workloads.

For organizations looking to build a greener cloud, improving sustainability goes hand in hand with other types of cloud optimization. For example, by optimizing performance, a cloud architecture uses fewer resources and therefore also consumes fewer carbon resources.

Read our blog: How Data Center Optimization Can Reduce Carbon Emissions

Cloud Optimization with Well-Architected Frameworks

Understanding the need for cloud optimization, and knowing that it can be a complex undertaking, the major cloud providers have prepared what is known as “Well-Architected Frameworks” for their respective cloud ecosystems.

A Well-Architected Framework is a blueprint that provides design principles and best practices to help build cloud optimization into your cloud deployment from the get go. These frameworks are different for each cloud provider, because they address their specific service structure, pricing models, and automation capabilities.

Let’s review the Well-Architected Framework provided by the three biggest cloud providers: AWS, Microsoft Azure, and Google Cloud.

AWS Well-Architected Framework

The AWS Well-Architected framework helps you understand architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the Amazon cloud. It provides a way to consistently measure the architecture against best practices and identify areas for improvement.

Amazon’s Well-Architected Framework is built on six pillars:

Operational excellence—the ability to run workloads effectively, gain visibility over operations, and continuously improve them to provide business value, while supporting high velocity development.
Security—the ability to leverage cloud technology to protect sensitive data, resources, and assets running in the cloud and create a strong security posture.
Reliability—the ability of workloads to perform their function correctly when needed, even in the face of unexpected events. This includes the ability to operate and test each workload effectively throughout its lifecycle.
Performance efficiency—the ability to use cloud-based resources efficiently to meet each workload’s requirements, and ensure workloads provide the optimal performance within resource constraints, even as technology and business requirements evolve.
Cost optimization—the ability to run systems at the lowest possible cost while still delivering the required business value.
Sustainability—the ability to reduce energy consumption and improve efficiency of a cloud deployment to minimize the impact on the environment.

Get the AWS Well-Architected Framework

Microsoft Azure Well-Architected Framework

The Azure Well-Architected Framework is a set of guiding principles you can use to improve the quality and efficiency of workloads running in Azure. It consists of five pillars:

Reliability—the ability for cloud workloads to recover from failure and continue operating even under adverse conditions.
Security—the ability to protect sensitive applications and data from threats, leveraging the security capabilities of the Azure cloud and following its security best practices.
Cost optimization—the ability to manage costs to maximize the value delivered from an organization’s investment in Azure.
Operational excellence—the ability to develop efficient and effective processes that keep cloud systems running in production.
Performance efficiency—the ability of cloud workloads to adapt to changes in load, leveraging Azure’s cloud automation capabilities.

By integrating these pillars into your Azure deployment, you can achieve a reliable, efficient, and cost effective cloud architecture.

Get the Azure Well-Architected Framework

Google Cloud Architecture Framework

The Google Cloud Architecture Framework provides advice, recommendations, and practical best practices. The objective is to help architects, developers, administrators, and other cloud professionals design and operate secure, efficient, resilient, high-performance, and cost-effective cloud topologies in the Google Cloud Platform (GCP).

The Google Cloud Architecture Framework is organized into six pillars:

System design—describes Google Cloud products and features that support effective system design, defining the architecture, components, modules, interfaces, and data needed to meet cloud system requirements.
Operational efficiency—describes how to efficiently deploy, operate, monitor and manage workloads in Google Cloud.
Security, privacy, and compliance—describes how to maximize security of data and workloads, design a cloud architecture with privacy in mind, and comply with regulatory requirements and standards.
Reliability—describes how to design and operate workloads with elastic scalability and high resiliency to adverse conditions.
Cost optimization—describes how to maximize the business value of Google Cloud investment.
Performance optimization—describes how to design and tune cloud resources for optimal performance.

Get Google Cloud Architecture Framework

Cloud Optimization with Auto Scaling

Auto scaling is a critical component of any cloud optimization strategy. Let’s see how each of the major cloud providers supports auto scaling to help you automatically match resources to workload requirements.

AWS and EC2 Auto Scaling

AWS Auto Scaling is an AWS service that lets you optimize performance and reduce costs across multiple cloud resources. It lets you scale collections of resources to support applications, while supporting application reliability.

AWS Auto Scaling features include:

Amazon EC2 scaling—Launching or terminating Amazon EC2 instances by adding them to an EC2 Auto Scaling Group.
Spot Fleets—a Spot Fleet is comprised of multiple spot instances running the same workload. Spot instances are offered at discounts of up to 90%, but they can be terminated at short notice. If this happens, workloads are transitioned to other instances in the Spot Fleet.
ECS scaling—adjusting Elastic Container Service (ECS) services to the required number of containers in response to application load.
DynamoDB secondary indexes—automatically adding secondary indexes to DynamoDB tables to improve provisioned read and write capacity, increasing throughput without throttling.
Amazon Aurora—the serverless Aurora database can automatically scale database read replicas up and down to meet application loads.
Resource discovery—automatically discovering scalable cloud resources that make up an application and smartly grouping them into collections.
Scaling strategies—there are three built-in strategies: optimize performance, optimize costs, and balance performance/costs. Alternatively, you can define your custom scaling policies.
Predictive scaling—using machine learning to analyze historic workload behavior and perform scaling events in anticipation of future application load.

Related content: Read our guide to EC2 auto scaling

Azure Auto Scaling

Azure lets you automatically add resources to maintain the desired performance levels and service-level agreements (SLAs) of your workloads, by adding cloud resources to increase capacity, or de-allocating resources to save costs.

Azure auto scaling capabilities include:

Scaling for Azure Virtual Machines (VMs)—defining virtual machine scale sets (VMSS), which manage and scale Azure VMs as a group.
Service Fabric—a distributed systems platform for managing microservices and containers, which supports auto scaling through VMSS. Every node in a Service Fabric cluster is defined as its own VMSS. This allows each node to be scaled independently.
Scaling Azure App Service—automated scaling for web applications hosted in Azure, with consistent auto scale settings that can be applied to all of the apps within App Service.
Scaling for individual Azure services—most Azure cloud services provide built-in auto scaling, which can be controlled at the role level.

Related content: Read our guide to Azure auto scaling

Google Cloud Auto Scaling

Google Compute Engine (GCE) provides its own auto scaling mechanism. You can define a managed instance group (MIG), similar to an EC2 Auto Scaling Group or Azure VMSS, and automatically add or remove VMs from the group based on application load. You can auto scale a MIG based on:

CPU utilization
Any metric from Google Cloud Monitoring
Predefined schedules
Load balancing capacity—this lets you define the serving capacity of an instance in the load balancer, and use this information to scale instances according to utilization or requests per second.

First-party Cloud Optimization Tools

All major cloud providers provide first-party tools, most of them provided free, which support cloud optimization. While these tools are useful and an important part of any cloud optimization strategy, they are typically not enough.

You can use these first party tools to create an initial baseline of visibility and workload optimization, but to achieve holistic, fully automated optimization across your hybrid cloud or multi-cloud environment, organizations turn to dedicated cloud optimization solutions.

AWS Trusted Advisor

AWS Trusted Advisor is an automated tool that provides guidance on best practices for Amazon services. Trusted Advisor examines your AWS environment and makes recommendations to help you reduce costs, improve system availability and performance, or close security gaps.

An important feature of AWS Trusted Advisor is to support right sizing—this is the process of matching an instance type and size to the performance and capacity requirements of a workload at the lowest possible cost. It also involves looking at deployed instances and finding opportunities to eliminate or scale down instances, without sacrificing business requirements.

Trusted Advisor can automatically create a sizing plan, help you identify opportunities for optimization and tag instances to carry out effective resizing. However, it does not automatically resize instances, and still requires manual work.

AWS Graviton Processor

AWS Graviton processors are central processing units (CPUs) designed by AWS to provide the best price/performance for cloud workloads running on Amazon EC2.

AWS Graviton2 provides significant performance and feature improvements over the first-generation Graviton processor. Graviton2-based instances provide the best price/performance for Amazon EC2 workloads. Amazon provides Graviton2-based instances supporting general-purpose, burstable, compute-optimized, memory-optimized, storage-optimized, and accelerated compute workloads.

AWS Graviton3 is the newest addition to the AWS Graviton series, which provides 25% faster compute performance, 2x faster floating point performance, and 2x faster performance for cryptographic workloads compared to Graviton2 processor.

Learn more in our detailed guide to AWS Graviton

Azure Advisor

Azure Advisor is a free Azure service that helps you optimize your Azure resources for high availability, security, performance, and cost. Advisor scans resource usage and configuration and provides over 100 personalized recommendations. Each recommendation includes inline actions to quickly and easily fix cloud resource optimizations.

Like AWS Trusted Advisor, Azure Advisor can also support automated identification and planning of cloud resource resizing, but requires manual work to perform the actual resizing tasks.

Azure Architecture Center

Azure Architecture Center is a collection of free guides written by Azure experts to help you understand organizational and architectural best practices and optimize your workloads. These guides are especially useful when designing new workloads for the cloud or migrating existing workloads from on-premises environments to the cloud.

The Azure Architecture Center guides cloud adoption and strategy by providing recommended architectures and best practices for common scenarios. These include AI, IoT, and microservices, SAP business applications, and web applications.

GCP Resource Usage Optimization

Google provides the Resource Usage Optimization Recommender—similar to the Advisor tools in Amazon and Azure. It helps you balance cost and performance in your environment.

For example, the Recommender helps identify if there are unused resources, better services you can use to deploy existing applications, or suggest using custom VMs. Custom VMs are a unique feature on Google Cloud that lets you build your own cloud instance type to suit your specific workload requirements.

The Recommender also helps detect under- or over-provisioned VM instances and idle resources, providing recommendations for right sizing. It supports decentralized optimization, helping individual application owners in optimizing their own workloads, because they can usually make the best decisions for terminating or resizing resources.

See Additional Guides on Key Hybrid Cloud Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of hybrid cloud.