What Is Apache Spark on AWS?
Apache Spark is an open source, distributed data processing system for big data applications. It enables fast data analysis using in-memory caching and optimized query execution. Spark offers development APIs for Scala, R, Java, and Python, enabling you to reuse code across different workloads, including batch processing, real-time analytics, interactive queries, graph processing, and machine learning.
Amazon Elastic MapReduce (EMR) offers a suitable cloud environment for deploying Apache Spark. It combines the testing and integration rigor of commercial Spark and Hadoop distributions with the simplicity, cost-effectiveness, and scale of the Amazon cloud. With EMR, you can launch Spark clusters quickly without provisioning nodes, configuring Spark, or setting up and tuning clusters.
EMR allows you to provision up to thousands of instances in minutes. You can enable EMR to scale Spark clusters automatically using Auto Scaling—this lets you process large amounts of data and scale back down upon the job’s completion to avoid paying for unneeded capacity.
In this article:
- Apache Spark on Amazon EMR: Features and Benefits
- Create a Cluster with Spark
- 4 Ways to Optimize Spark Performance on AWS EMR
Apache Spark on Amazon EMR: Features and Benefits
Here are some important EMR features:
Amazon EMR runtime for Apache Spark
This feature is a runtime environment optimized for performance—it runs by default on every Amazon EMR cluster. Clusters with the EMR runtime are over three times faster than clusters without and are fully API-compatible with Apache Spark. The improved performance allows workloads to run faster, reducing computing costs without requiring application changes.
Directed acyclic graph (DAG) execution engine
This engine allows Spark to plan queries efficiently for data transformation projects. Spark stores input, intermediate, and output data in memory using resilient data frames—these enable fast processing without incurring I/O costs, boosting the performance of interactive and iterative workloads.
Apache Spark has native Java, SQL, Python, and Scala support, providing various language options to build applications. You can also use the Spark SQL module to submit HiveQL and SQL queries. The Spark API runs applications and allows you to interact directly with Scala or Python in Spark using EMR Studio. You can also use Jupyter notebooks on your clusters. EMR 6.0 supports Apache Hadoop 3.0, allowing you to leverage Docker containers to simplify dependency management.
Spark offers several libraries that help you build applications for specific use cases—MLlib (machine learning), Spark Streaming (stream processing), and GraphX (graph processing). These tightly integrated libraries are available out of the box and can address various use cases. You can also leverage a deep learning framework like Apache MXNet for Spark applications. Integrating with AWS Step Functions will allow you to automate and orchestrate serverless workflows.
You can use APIs like EMR Step to submit Spark jobs. Use Apache Spark with EMRFS to access data directly in Amazon S3. EC2 Spot lets you save costs by leveraging unused Amazon capacity. EMR Managed Scaling allows you to dynamically add or remove capacity and launch transient or long-lived clusters depending on the workload.
You can use EMR security configurations to set up Spark authentication and encryption with Kerberos. Another useful feature is AWS Glue Data Catalog, which lets you store metadata from Spark SQL tables. You can also use Amazon SageMaker for machine learning Spark pipelines.
Learn more in our detailed guide to Spark best practices (coming soon)
Create a Cluster with Spark
The following procedure creates a cluster with Spark installed using Quick Options in the EMR console.
To launch a cluster with Spark installed via the Amazon console:
- Open the Amazon EMR console.
- Click Create cluster.
- Enter a Cluster name.
- Under Software Configuration, choose a Release option.
- Under Applications, choose the Spark application bundle.
- Select other options as necessary and then click the Create cluster button.
To launch a cluster with Spark installed using the AWS CLI:
Create the cluster with the following command:
aws emr create-cluster --name "Spark cluster" --release-label emr-5.36.0 --applications Name=Spark \ --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --use-default-roles
4 Ways to Optimize Spark Performance on AWS EMR
EMR offers several features to help optimize performance in Spark.
1. Adaptive Query Execution
Adaptive query execution allows you to re-optimize query plans according to runtime statistics. Here are some Apache Spark 3 adaptive query execution optimizations available on EMR Runtime:
- Adaptive join conversion—converts sort-merge-join operations to broadcast-hash-join based on the query stage’s runtime size. Broadcast-hash-join operations usually perform better if one side of the join is small.
- Adaptive coalescing of shuffle partitions—coalesces small shuffle partitions to prevent the overhead from having too many tasks. To improve performance, you can configure more shuffle partitions upfront and reduce them at runtime.
2. Dynamic Partition Pruning
You can improve job performance by accurately selecting the table partitions that require processing for your query. Pruning dynamic partitions can help reduce the amount of data processed, saving time when executing the job execution (this feature is active by default in Amazon EMR 5.26.0).
With dynamic partition pruning, the Spark engine can dynamically infer whether it must read a partition or safely eliminate it. For example, if a query has tables for different regions, you can prune the partitions if you only want to query in one region. Otherwise, the query will process redundant data about other regions before filtering the relevant region.
3. Bloom Filter Join
This optimization improves the performance of join operations by using a Bloom filter to pre-filter one side of the join based on values from the other side. When enabled, the system builds a Bloom filter from the items in a queried category. For example, if you filter a table containing sales data, you can quickly filter out the sale items that are not within the defined category.
4. Optimized Join Reordering
This feature reorders table joins using filters (enabled in Amazon EMR 5.26.0 by default). Without the optimized join reorder function, Spark will join the first two components in the list when selecting items before joining with other components. With this function, Spark joins the smallest items with a filter first.
Optimizing Apache Spark With Granulate
Granulate optimizes Apache Spark on a number of levels. With Granulate, Spark executor dynamic allocation is optimized based on job patterns and predictive idle heuristics. It also autonomously and continuously optimizes JVM runtimes and tunes the Spark infrastructure itself. Granulate also has the ability to reduce costs on other execution engines, like Kafka, PySpark, Tez and MapReduce.