What Is Apache Spark?
Apache Spark is an analytics engine that rapidly performs processing tasks on large datasets. It can distribute data processing tasks on single-node machines or clusters. It can work independently and in cooperation with other distributed computing tools.
Apache Spark supports big data and machine learning processes that require massive computing power to process large data volumes. It abstracts various distributed computing and big data processing tasks, providing an easy-to-use API.
Here are some important features of Apache Spark:
- Diverse deployment options
- Native bindings for programming languages like Java, Python, R, and Scala
- SQL support
- Data streaming
- Graph processing
Apache Spark has diverse use cases. It is commonly used in industries such as finance, gaming, and telecommunications. Spark is in wide production use by major technology companies including Apple, IBM, Microsoft, and Facebook.
This is part of an extensive series of guides about open source.
In this article:
- Spark Architecture and Components
- Apache Spark in the Cloud
- Spark Performance Optimizations Best Practices
- 5 Apache Spark Alternatives
Spark Architecture and Components
A Spark application operates on a cluster as a set of independent processes managed by an object called SparkContext in the main program (a driver).
SparkContext can run on the cluster by connecting to different types of cluster manager (Spark’s standalone manager, YARN, Mesos, or Kubernetes) to allocate resources between applications. When connected, Spark gets an executor from a node in the cluster. This is the process of performing calculations and storing data for an application. It then submits the application code (defined in a Python or JAR file passed to the SparkContext object) to the launcher. Finally, the SparkContext submits the job to the executor for execution.
Image Source: Spark
Resilient Distributed Dataset (RDD)
An RDD is a collection of fault-tolerant elements that can be distributed and processed in parallel across multiple nodes within a cluster. RDDs are the basic architecture of Apache Spark.
Spark loads the data either by referencing the data source or using SparkContext parallelization to process the existing collection in an RDD. When you load data into the RDD, Spark will perform transformations and operations on the in-memory RDD. This is the essence of Spark’s speed. Spark stores data in memory by default, though the user can choose to write data to disk to ensure persistence, because the system might run out of memory.
Each data set in an RDD is partitioned into logical components and can run on different nodes in the cluster. Users can also perform two RDD activities: transformations and jobs. A transform is an operation that creates a new RDD. Jobs tell Apache Spark to apply computations and return results to the driver.
Spark supports various operations and transformations in RDD. Spark handles this deployment, so users don’t have to calculate the correct deployment.
Directed Acyclic Graph (DAG)
Unlike MapReduce’s two-step execution process, Spark creates a directed acyclic graph (DAG) to schedule jobs and orchestrate worker nodes across a cluster. As Spark processes and transforms data during job execution, the DAG scheduler adjusts the cluster’s worker nodes to increase efficiency. This job tracking enables fault tolerance by re-applying logged actions to data in a previous state.
Datasets and DataFrames
In addition to RDD, Spark also handles two data types: DataFrame and Dataset.
DataFrame is the most popular structured application programming interface (API) for representing data tables in rows and columns. RDDs are an important feature of Spark, but are currently in maintenance mode. The popularity of Spark’s machine learning library (MLlib) has made DataFrame the native API for MLlib. DataFrames provide consistency across different languages like Java, Python, R, and Scala, so it’s important to keep this in mind when using this API.
Datasets are extensions of DataFrame that provide an object-oriented, type-safe programming interface. Basically, a dataset is a strongly typed collection of JVM objects, unlike a DataFrame.
Spark SQL allows you to query data from SQL datastores such as DataFrames and Apache Hive. Spark SQL queries return DataFrames or Datasets when executed in another language.
Spark Core is the foundation of parallel processing, process scheduling, RDD, data abstraction, and optimization. Spark Core provides a functional foundation for processing Spark SQL, GraphX graph data, Spark Streaming, Spark libraries, and MLlib machine learning libraries. Alongside the cluster manager, Spark Core distributes and abstracts data across Spark clusters. This deployment and abstraction makes big data processing very fast and user-friendly.
Spark Streaming is a package for processing data streams in real time. There are several types of real-time data streams. Examples include e-commerce websites that log page visits in real time, credit card transactions, and taxi provider apps that send travel and location information for drivers and passengers. Simply put, all of these applications are hosted on multiple web servers that generate event logs in real time.
Spark Streaming defines more APIs to leverage RDD and process data streams in real time. Spark Streaming leverages RDD and its APIs, making it easy for developers to learn and execute use cases without having to learn an entirely new technology stack.
Spark 2.x introduced structured streaming. It uses DataFrames instead of RDDs to process data streams. By using DataFrames as a computational abstraction, all the benefits of the DataFrame API can be applied to stream processing. In the next chapter, we will discuss the advantages of DataFrames over RDD.
Spark Streaming integrates seamlessly with the most popular data message queues such as Apache Flume and Kafka. You can easily attach to these queues to process large data streams.
Learn more in our detailed guide to Spark streaming (coming soon)
Spark offers several a variety of APIs that bring Spark’s power to a broader audience. Spark SQL enables relational interaction with RDD data. Spark also offers well-documented APIs for Java, Python, R, and Scala. Each Spark language API handles data in a nuanced way. RDDs, Datasets, and DataFrames are available for every language API. The range of languages covered by Spark APIs makes big data processing accessible to diverse users with development, data science, statistics, and other backgrounds.
Learn more in our detailed guide to Apache Spark architecture (coming soon)
Apache Spark in the Cloud
Spark on AWS
Amazon supports Apache Spark as part of the Amazon Elastic Map/Reduce (EMR) service. EMR makes it possible to quickly create a managed Spark cluster from the AWS Management Console, AWS CLI, or Amazon EMR API.
EMR features and integrated Amazon services
The managed cluster can benefit from additional Amazon EMR features, including:
- Fast connection to Amazon S3 using the EMR File System (EMRFS)
- Integration with the Amazon EC2 spot market to leverage low-cost spot instances
- AWS Glue Data Catalog
- EMR managed extensions
Spark on EMR can also benefit from AWS Lake Formation, which provides fine-grained access control for data lakes, and integration with AWS Step Functions to orchestrate data pipeline processes.
Spark development with EMR Studio and Zeppelin
Developers can leverage Amazon EMR Studio, an integrated development environment (IDE) for data scientists and data engineers. EMR Studio enables development, visualization, and debugging for applications written in R, Python, Scala and PySpark.
EMR Studio provides fully managed Jupyter Notebooks and tools like the Spark UI and YARN timeline service to simplify debugging of Spark applications and make it easier to build applications with Spark. Alternatively, developers can use Apache Zeppelin to create interactive, collaborative notebooks for data exploration with Spark.
Learn more in our detailed guide to Spark on AWS
Spark on Azure
Microsoft offers several services based on Apache Spark in the Azure cloud:
- Apache Spark on Azure HDInsight—facilitates the creation and configuration of Spark clusters, allowing customization of the entire Spark environment on Azure. The service can store and process any type of data in Azure. It can work with an organization’s existing Azure Blob Storage or Azure Data Lake Storage Gen1/Gen2.
- Spark Pools in Azure Synapse Analytics—supports loading, modeling, processing, and distributing data for analytics insights on Azure.
- Apache Spark on Azure Databricks—uses Spark clusters to power interactive, collaborative workspaces. It can read data from multiple data sources within and outside Azure.
- Spark Activities in Azure Data Factory—provides access to Spark analytics on demand within a data pipeline, using existing Spark clusters or new clusters created on demand.
Learn more in our detailed guide to Spark on Azure (coming soon)
Spark on Google Cloud
Google Cloud’s Dataproc Serverless service can run Spark batch workloads without needing to provision and manage clusters. Users can specify workload parameters and submit a workload to Dataproc Serverless—the service runs workloads on a managed compute infrastructure and automatically scales resources as needed. Pricing applies only to the actual time workloads are processed.
The Dataproc Serverless for Spark service lets you run various Spark workloads, including Pyspark, Spark R, Spark SQL, and Spark Java/Scala. It allows you to specify Spark properties when submitting a Spark batch workload.
In addition to Dadtaproc, you can also use Airflow, a task automation tool, to schedule a Spark batch workload in Google Cloud. Alternatively, you can use Google’s Cloud Composer workflow with the Airflow batch operator.
Spark Performance Optimizations Best Practices
You can use various languages to write machine learning applications. When using Spark, you can opt for Python or R. Using structured APIs like DataFrame and Dataset does not impact performance because the Spark code boils down to resilient distributed datasets (RDD) code, which does not require a Python or R interpreter. However, you should write user-defined functions (UDF) in Scala. You should also opt for Scala or Java to avoid serialization issues.
Serialization and Compression
When using custom data types in Spark applications, your should serialize them. The default option is Java serialization, but you can change it to Kryo to improve performance. Objects serialized using Java are typically slow and larger compared to Kryo. You can do this by changing the spark.serializer property to org.apache.spark.serializer.KryoSerializer.
Data storage determines the speed of retrieval. You can process data faster by implementing partition or bucketing strategies. Spark provides partition and bucketing when storing DataFrames. Partitioning creates files under directories according to a key field. However, the key field requires low cardinality, which means the field must have lesser possible values.
If the cardinality is too high, it creates too many partitions that can result in a bottleneck. Each partition should have a minimum of 128 MB of data. Spark processes only the given partition and ignores the rest when reading this data. You can implement bucketing for higher cardinality. It enables you to group and store related keys together.
Spark SQL shuffles 200 partitions by default, and so when you use a transformation that shuffles data, you’ll get 200 shuffle blocks as an output. If your dataset is large, and you use the default number of partitions, the shuffle blocks can get big. Spark does not allow a shuffle block size greater than 2 GB, so this might result in a runtime exception.
You can avoid these exceptions by increasing the number of shuffle partitions to ensure the block size is around 128 MB (the ideal size recommended in the Spark documentation).
Learn more in our detailed guides to:
- Spark performance
- Spark optimization techniques (coming soon)
- Spark performance tuning (coming soon)
- Spark best practices (coming soon)
5 Apache Spark Alternatives
1. Apache Hadoop
Apache Hadoop is a framework that enables distributed processing of large data sets on clusters of computers, using a simple programming model. The framework is designed to scale from a single server to thousands, each providing local compute and storage. Apache Hadoop has its own file distribution system called the Hadoop Distributed File System (HDFS).
2. Apache Storm
Apache Storm is an open source distributed real-time computing system. Developers primarily use it to process data streams in real time. Apache Storm has many use cases such as real-time analytics, online machine learning, continuous computing, distributed remote procedure calls (RPC), extract-transform-load (ETL) processes, and more.
Storm is integrated with popular databases. It is scalable and fault-tolerant, handles data in a simple manner, and is designed to be easy to set up and operate.
3. Apache Flink
Apache Flink is a distributed processing engine for stateful computations on streams of data—both bounded (with a defined start and end) and unbounded. It is designed to run on any cluster environment and perform computations at any scale or memory speed.
Flink can be used to develop and run many types of applications. Key features include streaming and batch processing support, advanced state management, event semantics, and exactly-once consistency guarantees.
Learn more in our detailed guide to Spark vs. Flink (coming soon)
4. Apache Kafka
Apache Kafka is a distributed publish/subscribe messaging system written in Scala and Java. It takes data from heterogeneous source systems and makes it available to target systems in real time. Kafka is commonly used to process real-time event streams from big data.
Like other message broker systems, Kafka enables asynchronous data exchange between processes, applications, and servers. However, unlike other messaging systems, Kafka does not track consumer behavior and does not delete read messages, so its overhead is very low. Instead, Kafka keeps all messages for a period of time, and consumers are responsible for tracking which messages have been read. This allows Kafka to process streams with very high throughput.
Learn more in our detailed guide to Spark vs. Kafka (coming soon)
Databricks is a commercial solution developed by the creators of Apache Spark, based on the lakehouse architecture (combining data lakes and data warehouses).
Databricks can run on-premises and in all major cloud platforms. It aims to simplify data processing for data scientists and engineers, and support development of big data and machine learning applications in R, Scala, Python, using SQL or SparkSQL. Databricks integrates with data visualization tools such as Power BI, Qlikview, and Tableau, and supports building predictive models with SparkML.
Optimizing Apache Spark With Granulate
Granulate optimizes Apache Spark on a number of levels. With Granulate, Spark executor dynamic allocation is optimized based on job patterns and predictive idle heuristics. It also autonomously and continuously optimizes JVM runtimes and tunes the Spark infrastructure itself. Granulate also has the ability to reduce costs on other execution engines, like Kafka, PySpark, Tez and MapReduce.
Learn more in our detailed guides to:
- Spark vs databricks (coming soon)
- Spark alternatives (coming soon)
See Our Additional Guides on Key Open Source Topics
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of open source.
- Hadoop vs. Spark: 5 Key Differences and Using Them Together
- Running Hadoop on AWS: the Basics and 5 Tips for Success
Authored by Mend