Software engineering organizations in the cloud have prioritized customer experience and security, often at the expense of cost efficiency. As cloud infrastructure becomes increasingly complex, cloud costs can quickly spiral out of control if not managed properly.
Companies need to find a balance between providing secure and resilient services while also ensuring that their cloud resources are optimized for cost savings. In this article, we’ll explore how companies can strike a balance between security, resiliency and cost efficiency by elevating cost efficiency to equal status. This creates the three legged stool needed to have a truly modern, secure, resilient and efficient environment for your workloads.
Defining Success When the Stakes are High
Before an enterprise begins to tackle their cloud costs, they need to invest in defining meaningful goals and key metrics to help establish what a successful initiative will look like. Cost efficiency doesn’t have to mean sacrificing security or reliability, so it’s important to get this right.
It is essential for any software engineering initiative, including cost efficiency, to have a clearly defined end-state or goal in order to ensure success. Without an achievable and well-defined outcome, organizations may struggle to measure progress or efficiency, leading to costly missteps and confusion among stakeholders. A definition of done helps establish a shared understanding of goals from the outset, and critically establishes a benchmark for testing and evaluation of future initiatives. . When discussing cost efficiency, the key objectives should always be: security, resiliency, and customer experience.
Making security, resiliency and efficiency peers to one another for engineering decisions enables organizations to balance risk, customer experience and cost. Our clients constantly review their environments, tech stack and initiatives to ensure they are taking advantage of the latest technologies to meet their goals across all three of these key pillars.
Walter Somsel, CEO/ Founder of Cloud Advisors.
Security is non-negotiable when it comes to cloud infrastructure; customer data is valuable to attackers, and if compromised, it can lead to significant financial, reputational, and legal consequences. A well-publicized attack or compromise can permanently destroy customer trust and forever leave a negative perception associated with a brand. It represents nothing short of an existential threat to the life and continued operation of any organization. Security should be considered a must-have aspect of customer experience; users should know an organization has done its utmost to secure their critical data.
Keeping an application stack resilient and able to deliver on performance promises is another key component of customer experience, although throwing more resources (capital) at the problem is rarely the right solution. It’s important to recognize that cost optimization is an ongoing process; it requires a long-term commitment to monitoring and observability initiatives within the application infrastructure, as well as the ability to convert that data into actionable insight.
The Critical Objective: Security
Security always has to be a top priority. While that doesn’t mean it should ever come at the expense of customer experience or developer productivity, the end result of any initiative, feature delivery or change should never be a worse security posture. Software organizations shouldn’t look at cost and security as a compromise; it’s not necessary to take from one to feed the other.
A great example of how cost and security objectives are in alignment is around infrastructure sprawl. Improperly configured settings and security policies may allow users to provision unnecessary, excessively large, or misconfigured resources.. Most readers will probably be familiar with cloud environments where unaccounted for resources and shadow projects are sprinkled across dozens of un-monitored cloud provider accounts. Not only does this lead to spiraling costs, but these resources represent exponential growth in attack surface. Even a fully secured compute node provisioned in predefined network zones with policy and configuration controls still represents some risk: now imagine what a node provisioned in a public network with mostly default settings represents!
Another example where cost efficiency and security are in lockstep is autoscaling. Autoscaling is a general term used to describe a service (or a set of services) provided by cloud platforms that enable customers to dynamically adjust or allocate the amount of resources deployed for a given workload based on different factors. Autoscaling automation is typically used in response to performance-based factors, such as CPU or memory utilization. Once a pre-defined utilization threshold is crossed, new resources are automatically provisioned and allocated for the same workload. Autoscaling can also be schedule-based, with resources scaled up or down at a fixed time of day. Again, keeping resource footprint low during low demand times reduces attack surface and costs.
The exercise of optimizing for cost efficiency can even have a knock-on effect of improving security posture. As mentioned in the previous section, cost optimization requires an ongoing effort to continuously improve monitoring and visibility. As a result of this continued focus on operational visibility, an organization will have a much greater sensitivity to changes or outlier metrics. A sudden increase in cost could represent a compromised compute node or a denial-of-service attack.
Focus on Customer Experience: Staying Resilient
Maintaining a good customer experience should be another top priority when delivering software applications. Modern web applications are complex, distributed systems that are subject to a wide variety of adverse conditions that typically aren’t found in legacy computing environments. Well-designed software infrastructure should be able to weather network congestion, data-center or cloud outages, and fluid user demand, all while delivering on performance and experience SLAs.
Unfortunately, software organizations are often ill-prepared to properly address these conditions in a way that’s good for the bottom line. A typical approach is to throw more resources at the problem until the infrastructure can handle peak times with some headroom. The stack is never scaled down, and over time costs balloon. It becomes a particularly insidious problem as cloud providers will often increase the cost of older, less-efficient resources as they migrate to more energy and space-efficient hardware. Eventually, their hand will be forced and engineering teams will often have to undergo costly and time-consuming migration exercises.
Correctly implementing resiliency and cost-efficiency requires careful planning and design choices up front. Improving latency and performance means teams are less likely to fall back to throwing resources at the problem, which increases attack surface. For instance, spending time to correctly decouple frontend components means that customer-facing processing or rendering workloads can be moved much closer to the user origin. Using a content delivery network (CDN) with edge computing means improved app responsiveness, as well as reduced network bandwidth consumption. Bandwidth in particular is typically one of the largest cost drivers in cloud computing, so any reduction here is a win.
Continuing with autoscaling, it can be used to great effect to deal with fluid usage demands. However, it requires understanding of what the actual load metrics are, and the shape and time of demand periods, as well as best-practice configurations for the cloud provider in question. For example, AWS autoscaling is not instantaneous, there will typically be a multiple-minute window between the identification of a need for more resources and those resources being available to serve requests. In a context where milliseconds matter, minutes aren’t enough. This can be mitigated through careful system design, but it’s not simply a matter of flipping a switch.
When looking to increase resiliency, software organizations will often implement geographic distribution of their compute and data storage. This can provide resiliency against regional or availability zone (AZ) failure, as well as any localized failure of network or internet infrastructure. Geographic distribution also factors heavily in most disaster recovery playbooks for cloud-hosted companies. Serving users effectively from multi-region application infrastructure can be a complex problem to tackle; as scale and geographic distance grow, careful consideration must be paid to designing for concerns such as data consistency. If an application serves traffic around the world, a single database cluster hosted in the United States will lead to poor performance for a large number of users.
Cost Efficiency Doesn’t Have to be a Compromise
“The key is to factor cost efficiency into the decision process in the architecture and development cycles. This includes starting early during initial planning and continues through all stages of the software life cycle. Too often we see efficiency not have the same consideration leading to excess costs impacting the cloud bill and a company’s margins.”
Walter Somsel, CEO/ Founder of Cloud Advisors.
These concerns create an environment where cost efficiency often gets thrown out. While these are still good objectives to focus on, they don’t have to be treated as opposite cost-efficient architecture. It doesn’t need to be a competition where security, resiliency and efficiency are competing for resources to the detriment of the others. Organizations need to find a balance between providing secure and resilient services while also ensuring that their cloud resources are optimized for cost savings. Before attempting to tackle cloud costs, organizations should set goals and key metrics based on carefully collected utilization and performance data. Prioritizing security and resiliency can still lead to cloud cost wins.
For enterprises that want their cloud costs to be as efficient as possible, Intel Tiber App-Level Optimization is the perfect solution for organizations looking to strike a balance between providing secure and resilient services while ensuring cloud resources are optimized for maximum cost efficiency. Our cloud infrastructure expertise allows us to identify and implement areas of improvement that maximize cloud performance, while strategically scaling cloud usage to ensure it remains in line with customer demand.