Spark Streaming (Spark Structured Streaming): the Basics and a Quick Tutorial
What Is Apache Spark Streaming (Spark Structured Streaming)?
Apache Spark Streaming is a real-time data processing framework that enables developers to process streaming data in near real-time. It is a legacy streaming engine in Apache Spark that works by dividing continuous data streams into small batches and processing them using batch processing techniques.
However, Spark Streaming has some limitations, such as lack of fault-tolerance guarantees, limited API, and lack of support for many data sources. It has also stopped receiving updates.
Spark Structured Streaming is a newer and more powerful streaming engine that provides a declarative API and offers end-to-end fault tolerance guarantees. It leverages the power of Spark’s DataFrame API and can handle both streaming and batch data using the same programming model. Additionally, Structured Streaming offers a wide range of data sources, including Kafka, Azure Event Hubs, and more.
In this article:
- Spark Structured Streaming Use Cases
- Spark Structured Streaming Operating Model
- Spark Structured Streaming: Quick Tutorial
Spark Structured Streaming Use Cases
Spark Structured Streaming is a powerful tool for processing real-time data streams, and it has a wide range of use cases in various industries. Here are some examples of Spark Structured Streaming use cases:
- Fraud detection: Spark Structured Streaming can be used to detect fraud in real-time data streams, such as credit card transactions or insurance claims. It can identify patterns and anomalies in the data and trigger alerts or automated actions when suspicious activity is detected.
- Real-time analytics: Spark Structured Streaming can be used for real-time analytics and business intelligence applications, such as monitoring website traffic, analyzing customer behavior, or tracking supply chain operations. It provides real-time insights that can help organizations make faster and more informed decisions.
- IoT analytics: Spark Structured Streaming can be used to process and analyze real-time data from IoT sensors and devices. It can help organizations monitor and optimize industrial processes, smart home automation, and predictive maintenance.
- Video and audio streaming: Spark Structured Streaming can be used to process real-time video and audio streams, such as security camera footage, video conferencing, or live streaming events. It can provide real-time analysis and insights that can be used for surveillance, security, or audience engagement.
Spark Structured Streaming Operating Model
Spark Structured Streaming handles live data streams by dividing them into micro-batches and processing them as if they were a batch query on a static table. The resulting output is continuously updated as new data arrives, providing real-time insights on the data. This streaming data processing model is similar to batch processing.
Handling Input and Output Data
Here are some of the basic concepts in Structured Streaming:
In Spark Structured Streaming, the input data stream is treated as an unbounded table that can be queried using Spark’s DataFrame API. Each micro-batch of data is treated as a new “chunk” of rows in the unbounded table, and the query engine can generate a result table by applying operations to the unbounded table, just like a regular batch query. The result table is continuously updated as new data arrives, providing a real-time view of the streaming data.
In Structured Streaming, the output is defined by specifying a mode for the query. There are three output modes available:
- Complete mode: In this mode, the output table contains the complete set of results for all input data processed so far. Each time the query is executed, the entire output table is recomputed and written to the output sink. This mode is useful when you need to generate a complete snapshot of the data at a given point in time.
- Update mode: In this mode, the output table contains only the changed rows since the last time the query was executed. This mode is useful when you want to track changes to the data over time and maintain a history of the changes. The update mode requires that the output sink supports atomic updates and deletes.
- Append mode: In this mode, the output table contains only the new rows that have been added since the last time the query was executed. This mode is useful when you want to continuously append new data to an existing output table. The append mode requires that the output sink supports appending new data without modifying existing data.
The choice of mode depends on the use case and the capabilities of the output sink. Some sinks, such as databases or file systems, may support only one mode, while others may support multiple modes.
Handling Late and Event-Time Data
Event-time data is a concept in stream processing that refers to the time when an event actually occurred. It is usually different from the processing time, which is the time when the system receives and processes the event. Event-time data is important in many use cases, including IoT device-generated events, where the timing of events is critical.
Late data is the data that arrives after the time window for a particular batch of data has closed. It can occur due to network delays, system failures, or other factors. Late data can cause issues in processing if not handled correctly, as it can result in incorrect results and data loss.
Using an event-time column to track IoT device-generated events allows the system to accurately process events based on when they actually occurred, rather than when they were received by the system.
Fault Tolerance Semantics
Fault tolerance semantics refers to the guarantees that a streaming system provides to ensure that data is processed correctly and consistently in the presence of failures, such as network failures, node failures, or software bugs.
Idempotent streaming sinks are a feature of fault-tolerant streaming systems that ensure that data is written to the output sink exactly once, even in the event of failures. An idempotent sink can be called multiple times with the same data without causing duplicates in the output. Fault-tolerance semantics – such as end-to-end one-time semantics – provide a high degree of reliability and consistency in streaming systems.
Spark Structured Streaming: Quick Tutorial
For this example, suppose you want to create a running wordcount of the text data received while listening via a TCP socket. The following demonstrates how you might express this function with Spark Structured Streaming.
Step 1: Importing
Start by importing the required classes and creating a SparkSession in your local environment, which will serve as the base for all Spark-related functionalities:
val spark = SparkSession
Step 2: Creating the Data Frame
Now you can create a streaming DataFrame to represent the text data sent by the listening server. In this example, the server uses localhost:9999. You can then transform the DataFrame and enable it to count words:
// Create DataFrame for the input text stream from the localhost:9999 connection
val lines = spark.readStream
// Divide the text into words
val words = lines.as[String].flatMap(_.split(” “))
// Generate the running wordcount
val wordcounts = words.groupBy(“value”).count()
The “input stream” DataFrame will represent the unbounded table that contains the streaming data. The table includes a column of strings called “value”, with each line of text being represented by a row in the table. At this stage, the system is not yet receiving data, because it hasn’t been started – this is the necessary transformation setup.
This configuration converts the DataFrame to a String dataset – the .as[String] element – allowing you to apply a flatMap operation to divide the lines of text into individual words. The resulting “words” dataset contains every word as a separate data point. This setup also defines the “wordcounts” DataFrame – the system groups the individual values in the dataset (the words) and counts them. The streaming DataFrame represents the stream’s running wordcount.
Step 3: Implementing the Wordcount
Once you’ve set up the streaming data query, the model can start processing data to computing the running wordcount. Enable it to publish the output of the completed wordcounts to the console. You specify this using outputMode(“done”) and initiate the text streaming computation with the start() function:
// Initiate a query to print the running wordcounts to the console
val query = wordcounts.writeStream
Once the code is executed, the system should have initiated the computation in the background. The query object provides a handle for this streaming query – in this example, we chose to wait for the query to terminate using the awaitTermination() function. The reason for this is to prevent the process from finishing while the query remains active.
Step 4: Executing the Code
You can execute this code example by compiling it in a Spark application or running the code after downloading Spark. This example uses the latter option, which will require a netcat to run as the data server. A netcat is a small utility available in most Unix-based systems. Use the following:
$ nc -lk 9999
Next, change into the directory containing the example code and type sbt to compile the code. In the sbt prompt, type run to execute the code. A new terminal window should appear.
To try the sample application, in the new terminal window, start typing random words. Once you are finished with the inputting text, switch to the second terminal and press Ctrl+C to stop processing. You will be presented with a summary table like the following:
The system should count and display the text entered in the terminal that runs the netcat server, updating the wordcount each second.
Spark Streaming with Granulate
Granulate optimizes Apache Spark on a number of levels. With Granulate, Spark executor dynamic allocation is optimized based on job patterns and predictive idle heuristics. It also autonomously and continuously optimizes JVM runtimes and tunes the Spark infrastructure itself.