How CI/CD is Sidetracking Optimization, and What You Can Do About It
High-velocity code changes are making it impossible to optimize infrastructure. But not all is lost in the battle for improved performance.Read more
Asaf EzraJan 14, 2020
Alert fatigue is that awful disorder that desensitizes people to security warnings by overwhelming them with waves of information until they can’t manage. With the onslaught of monitoring solutions for everything from server utilization to ad revenue anomalies, alert fatigue takes its toll. The syndrome has become pervasive in DevOps and IT departments where it engulfs individuals who are glued to never-ending streams of metrics, statuses or conditions. There is only so long that people can look at a screen and only so many alerts that they can deal with effectively. Overwhelmed with red flags and blinking “danger” windows – that usually indicate false positives or are irrelevant or inconsequential – burdened workers often choose to ignore them or fail to respond appropriately or in a timely manner.
Alert fatigue comes with a very heavy price tag – increased risk. Due to the surfeit of false-positive alerts, a real alert can be overlooked and it can go on to become the incident that appears in tomorrow’s newspaper. Incident response might come too late – after the damage is already done, causing major disruptions to revenue, cost, and brand reputation.
Manual thresholds are to blame for the majority of false-positive alerts. If you are monitoring your current users and activities using manual-based thresholds, you will have no way to factor in the effects of cycles such as seasonality, special sales, holidays, etc., resulting in the generation of thousands of false alarms that overwhelm staff.
Some advanced monitoring solutions offer Machine Learning-based thresholding which autonomously discovers the actual metrics baseline taking into consideration seasonality and other dynamic behavior. Concurrently monitoring hundreds – even thousands – of metrics and millions of data points, ML significantly decreases the volume of false-positive alerts.
Redundant alerts are another major cause of alert fatigue. With studies placing the number of redundant alerts as high as 60% of the total, it’s clear why consolidating these alerts and reducing the flow of reminders can help keep the alert load manageable. Some advanced solutions automatically consolidate redundant alerts by employing root-cause analysis that can bundle together similar alerts so as to reduce their volume without adding risk.
While advanced alert-processing solutions offer considerable relief from alert fatigue, the big solution to the problem comes from the next wave of DevOps and IT technologies which address all four stages of detection and response: monitoring, data acquisition and aggregation, analysis and action.
As opposed to mere monitoring solutions, these newer optimization solutions don’t alert on anomalous conditions in IT operations, but actually remediate them on the fly. At present, there are two levels of automated remediation solutions: semi and full. Semi-autonomous remediation is tasked with the remediation of issues within a prescribed set of tasks designated by IT and security managers (e.g., protocols for various problems). Fully autonomous remediation solutions are completely self-learning-and-correcting systems that require no human intervention at all.
Full auto-remediation is the holy grail of DevOps and IT departments, but they aren’t completely ready yet. Gartner expects them to fully mature over the next five years. Some technologies – like Granulate – are already disrupting the alert fatigue disease by replacing monitoring and alerting with real-time, continuous optimization. For businesses using these ultra-advanced technologies, alert fatigue is already a disease of the past.