How Do You Secure Apache Spark?
Apache Spark security includes methodologies, tools, and best practices to protect Spark applications and data from unauthorized access, theft, or damage. It involves securing data at rest and in transit, implementing proper authentication and authorization mechanisms, and ensuring the integrity of data processing and storage operations within a Spark environment.
The Spark framework offers built-in security features such as encryption capabilities, network security options, and integration with external security tools like LDAP for user authentication. However, correctly using these features requires a good understanding of Spark’s architecture and the potential vulnerabilities it faces in different deployment scenarios.
In this article:
- Why Is Spark Security Important?
- Examples of Recent Spark Security Vulnerabilities
- 6 Ways to Secure Your Spark Deployment
Why Is Spark Security Important?
Apache Spark is widely used for processing large datasets, making security crucial to prevent unauthorized access and ensure data integrity. In environments where sensitive information is handled, a security breach could lead to data loss, privacy violations, and legal repercussions.
Spark often operates within distributed computing environments, making it exposed to a wider range of potential attack vectors that must be addressed to protect the data and the infrastructure.
A secure Spark environment minimizes the risk of downtime caused by malicious activities or data breaches. This stability is necessary for organizations relying on Spark for critical operations, where any disruption could result in financial loss and damage to reputation.
Related content: Read our guide to Spark performance
Examples of Recent Spark Security Vulnerabilities
The following examples of severe vulnerabilities discovered in Apache Spark illustrates the security risks facing the platform.
CVE-2023-22946
CVE-2023-22946 was identified as a medium-severity vulnerability impacting Apache Spark versions prior to 3.4.0. It allows for privilege escalation through a malicious configuration class when using the ‘proxy-user’ feature with spark-submit. It enables applications to execute code with the privileges of the submitting user by including malicious classes on the classpath.
This vulnerability poses risks in environments that rely on proxy-user configurations, such as those using Apache Livy for application management. To mitigate it, upgrading to Apache Spark version 3.4.0 or later is recommended. Additionally, the configuration setting spark.submit.proxyUser.allowCustomClasspathInClusterMode should remain at its default value of “false” and not be overridden by submitted applications.
CVE-2022-31777
CVE-2022-31777 exposes a stored cross-site scripting (XSS) vulnerability found in Apache Spark versions 3.2.1 and earlier, including 3.3.0. It allows attackers to execute arbitrary JavaScript code on the browsers of end-users by embedding malicious scripts into the logs that are subsequently rendered in the Spark UI.
This flaw can lead to unauthorized access to user sessions or sensitive data displayed within the browser. Mitigation efforts involve upgrading to Spark version 3.2.2 or later, specifically 3.3.1 for 3.3.x, which addresses this vulnerability by sanitizing log output to prevent malicious script execution within the UI.
CVE-2022-33891
CVE-2022-33891 relates to a shell command injection vulnerability in the Spark UI. It affects versions up to 3.1.3 and from 3.2.0 to 3.2.1, allowing attackers to execute arbitrary shell commands on the server where Spark runs. This issue arises when Access Control Lists (ACLs) are enabled via the configuration option spark.acls.enable.
Attackers can exploit this vulnerability by impersonating an authenticated user and injecting malicious commands that the server executes. Mitigation requires upgrading to Apache Spark version 3.2.2 or later versions like 3.3.0, where this issue has been addressed and patched.
CVE-2021-38296
CVE-2021-38296 refers to a vulnerability in Apache Spark’s key negotiation process for RPC connections secured by spark.authenticate and spark.network.crypto.enabled. This flaw, present in versions 3.1.2 and earlier, allows attackers to intercept and decrypt traffic by exploiting the custom mutual authentication protocol used for key exchange.
The vulnerability exposes sensitive data transmitted over these connections, posing a risk to confidentiality. Mitigation involves updating to Apache Spark version 3.1.3 or later, where the vulnerability has been addressed.
CVE-2020-9480
CVE-2020-9480 exposes a remote code execution (RCE) vulnerability within Apache Spark’s standalone cluster manager when authentication is enabled (spark.authenticate). It allows attackers to bypass the shared secret authentication mechanism and execute arbitrary shell commands on the host machine running the Spark master.
This vulnerability affects Apache Spark versions 2.4.5 and earlier, potentially leading to unauthorized command execution and compromise of the host system. Mitigation strategies include updating to Apache Spark version 2.4.6 or 3.0.0, which contain fixes for this vulnerability, and limiting network access to cluster machines to trusted hosts.
6 Ways to Secure Your Spark Deployment
Here are some of the measures that organizations can take to ensure the security of their applications and data in Apache Spark.
1. Enable Encryption
For network communication, Spark supports AES-based encryption, ensuring that data exchanged between nodes is protected. This feature requires authentication to be enabled using spark.authenticate configuration parameter.
Local storage encryption can be activated to secure temporary data written to disks, including shuffle files and cached data. This not only prevents unauthorized access but also enhances overall data security within a Spark deployment.
To implement encryption, administrators must configure the relevant properties in Spark’s configuration file. For network encryption, properties such as spark.network.crypto.enabled must be set to true. Enabling disk I/O encryption involves setting spark.io.encryption.enabled to true and specifying key sizes and algorithms.
2. Implement Authentication and Authorization
Authentication and authorization in Apache Spark’s Web UIs are managed through servlet filters, without built-in filters provided by Spark itself. This requires deploying a custom filter that implements the desired authentication method. Once authentication is enabled, Spark supports setting up access control lists for applications, distinguishing between permissions to view and modify application states.
ACL configurations can specify individual users or groups with comma-separated values, enabling shared cluster environments to define administrators or developers with necessary access rights. The configuration also allows the use of wildcards (*) for broader access rights. Group membership is determined by a configurable group mapping provider.
3. Configure SSL
SSL is critical for securing communication across the cluster, including web UIs and data transfers. It allows administrators to enforce encrypted connections, preventing potential eavesdropping or data tampering by unauthorized parties. The configuration is hierarchical, enabling default settings for all communications and overrides for components like the web UI or history server.
For successful SSL configuration, users must specify several parameters, such as enabling SSL with ${ns}.enabled, defining key and trust store locations, passwords, and specifying the encryption protocol through ${ns}.protocol. These configurations require management of cryptographic keys and certificates to ensure they are securely generated and stored.
4. Include HTTP Security Headers
Apache Spark offers the capability to include HTTP security headers, increasing protection against common web vulnerabilities. By configuring specific properties, administrators can mitigate risks such as Cross-Site Scripting (XSS) and Cross-Frame Scripting (XFS), and enforce HTTP Strict Transport Security (HSTS).
For example, setting spark.ui.xXssProtection to 1; mode=block activates XSS filtering in browsers, preventing the rendering of pages when an XSS attack is detected. Enabling X-Content-Type-Options by setting spark.ui.xContentTypeOptions.enabled to true prevents MIME-sniffing attacks that could misinterpret files’ content types.
To further strengthen security, Spark allows the configuration of HSTS headers through spark.ui.strictTransportSecurity, which enforces secure connections (HTTPS) for all future interactions with the web UI. This helps prevent man-in-the-middle attacks.
5. Configure Ports for Network Security
When configuring Apache Spark for optimal network security, admins must manage and restrict ports used by Spark components. In standalone mode, specific ports are designated for web UI access and communication between cluster components, such as the master and worker nodes. For example, the master node’s web UI defaults to port 8080, while workers use port 7077 for job submissions.
Ensuring these ports are correctly configured and restricted to trusted networks is crucial for preventing unauthorized access. Spark also operates with a broader set of default ports across cluster managers, including dynamic allocation for executor communication and block manager interactions. These ports often default to random assignments but can be set to specific values for tighter control.
6. Secure Interaction with Kubernetes
When securing Spark interactions with Kubernetes, it is crucial to handle Kerberos-based authentication properly. Spark requires delegation tokens to authenticate non-local processes when communicating with Hadoop-based services behind Kerberos. These tokens are stored in Kubernetes Secrets shared between the Spark Driver and its Executors.
There are several methods to submit a Kerberos job in this context. One approach involves using a local Keytab and Principal, where Spark submits the job with the necessary Kerberos credentials configured. Another method is to use pre-populated secrets containing the delegation token within the Kubernetes namespace. This approach ensures that the necessary credentials are securely managed and accessible only to authorized Spark components.
Additionally, it is important to ensure that the Kerberos Key Distribution Center (KDC) is accessible from within the Kubernetes containers. This setup involves configuring the environment variable HADOOP_CONF_DIR or specifying the ConfigMap name with spark.kubernetes.hadoop.configMapName.
Spark Optimization with Intel Tiber
Intel Tiber App-Level Optimization optimizes Apache Spark on a number of levels. Spark executor dynamic allocation is optimized based on job patterns and predictive idle heuristics. It also autonomously and continuously optimizes JVM runtimes and tunes the Spark infrastructure itself.