AWS EMR Architecture and 3 Common Usage Patterns

What Is AWS EMR?

Amazon Elastic MapReduce (EMR) is a fully-managed big data processing service offered by Amazon Web Services (AWS). EMR allows users to easily process large amounts of data using popular open-source data-processing frameworks like Apache Spark, Hadoop, and Hive.

EMR provides a scalable and flexible infrastructure to process and analyze big data in a cost-effective way. It is useful for processing large volumes of data for a variety of use cases, such as log analysis, data warehousing, real-time streaming data processing, and machine learning.

EMR supports multiple data sources, including Amazon DynamoDB, S3, and RDS, making it easier to process data from different sources. EMR also provides a range of management and monitoring tools that make it easy to configure, manage, and troubleshoot EMR clusters.

In this article:

Amazon EMR Architecture
AWS EMR: Common Use Cases and Architecture Patterns
AWS EMR Architecture with Intel Tiber App-Level Optimization

Amazon EMR Architecture

EMR is built on a distributed computing architecture, with several layers that work together to provide a reliable and efficient platform for processing large amounts of data.

Storage Layer

The storage layer in EMR is responsible for storing the data that is processed by the cluster. EMR supports three types of storage: Hadoop Distributed File System (HDFS), EMR File System (EMRFS), and local file systems.

Hadoop Distributed File System (HDFS)

HDFS provides high-throughput access to application data. HDFS is a popular storage layer for Big Data processing because it allows data to be distributed across the cluster, enabling parallel processing of data. In EMR, HDFS is used as the default storage layer and is pre-configured in all EMR clusters. Data stored in HDFS is distributed across the nodes in the cluster, and the HDFS NameNode and DataNodes are responsible for managing the file system.

EMR File System (EMRFS)

EMRFS is a custom file system built by AWS that allows an EMR cluster to access data stored in the Amazon Simple Storage Service (S3). EMRFS provides a consistent view of the data across the cluster, even when the data is updated outside of the EMR cluster. EMRFS also supports features such as encryption, versioning, and access control. EMRFS can be used in conjunction with HDFS to provide a scalable and flexible storage layer for EMR clusters.

Local File System

The local file system in EMR is used for temporary storage and scratch space on the individual nodes in the cluster. Local storage is used for storing intermediate data and other temporary files generated during processing. Local storage is not persistent and is deleted when the node is terminated.

Resource Management Layer

The cluster resource management layer in EMR is responsible for managing the resources of the cluster, including CPU, memory, and network bandwidth. The resource manager is responsible for allocating resources to running applications and ensuring that the cluster is utilized efficiently.

YARN

In EMR, the resource manager used is Yet Another Resource Negotiator (YARN). YARN is a cluster management technology that is responsible for resource allocation and job scheduling. YARN is responsible for managing the resources of the cluster and scheduling applications to run on the cluster. YARN is designed to support multiple data processing frameworks, including Hadoop MapReduce and Apache Spark.

Data Processing Framework Layer

The data processing framework layer in EMR provides support for popular open-source data processing frameworks, such as Hadoop MapReduce and Apache Spark. These frameworks allow users to process and analyze large amounts of data in parallel across the cluster.

Hadoop MapReduce

The data processing layer in EMR provides support for popular open-source data processing frameworks, such as Hadoop MapReduce and Apache Spark. These frameworks allow users to process and analyze large amounts of data in parallel across the cluster.

Hadoop MapReduce

Hadoop MapReduce is a distributed data processing framework that provides a programming model for processing large datasets in parallel. MapReduce is designed to work with large datasets and can scale to thousands of nodes in a cluster. In EMR, Hadoop MapReduce is used for batch processing of large datasets.

Apache Spark

Apache Spark is a fast and general-purpose data processing framework that can be used for batch processing, real-time stream processing, and machine learning. Spark is designed to be faster and more flexible than Hadoop MapReduce, and can be used with different data sources, including HDFS (Hadoop), S3 (AWS), and Cassandra (Apache). In EMR, Apache Spark can serve a wide range of use cases, such as data processing, machine learning, and real-time analytics.

AWS EMR: Common Use Cases and Architecture Patterns

Here are some common use cases for EMR.

Batch ETL

Batch Extract, Transform, Load (ETL) workloads involve extracting data from diverse sources, transforming it into a format that can be consumed by business logic, and then loading it into a target data store for consumption. Batch ETL workloads are commonly used in data warehousing, where large volumes of data need to be processed and analyzed.

The architecture for a batch ETL workload typically involves the following steps:

Data ingestion: Data is extracted from various sources, such as relational databases, log files, or external APIs, and ingested into the ETL pipeline.
Integration of a transient EMR Cluster: A transient EMR cluster is spun up to process the data. EMR provides a scalable and flexible infrastructure to process and analyze large amounts of data using popular open-source data processing frameworks such as Hadoop, Spark, and Hive.
Business logic: Once the data is processed, business logic is applied to it. This can involve data transformations, data enrichment, and data validation.
External HIVE metastore: EMR supports integration with external HIVE metastores. A HIVE metastore is a database that stores metadata information about the data stored in Hadoop, allowing for efficient querying and processing of large datasets.
Consumption layer: Once the data has been processed and transformed, it is loaded into a target data store for consumption. This can be a data warehouse, a reporting database, or a BI tool.

Clickstream Analytics

Clickstream analytics refer to the process of analyzing user interactions with a website or application. Clickstream data provides valuable insights into user behavior, preferences, and patterns, which can be used to optimize user experience, improve conversions, and drive business growth.

To build a clickstream analytics application using AWS services and EMR, the following steps can be taken:

Data collection: Clickstream data can be collected using various tools, such as AWS CloudFront, AWS WAF, or a custom logging solution.
Data ingestion: The collected clickstream data can be ingested into an S3 bucket, which can then be used as a source for EMR.
EMR cluster creation: A transient EMR cluster can be created using the AWS EMR service, with Apache Spark or Hadoop as the processing engine.
Data processing: Once the EMR cluster is created, Spark or Hadoop can be used to process the clickstream data, transform it into a structured format, and store it in a data warehouse or database.
Analytics: The transformed clickstream data can be used for a variety of analytics use cases, such as generating reports, creating dashboards, or building machine learning models.
Visualization: The analytics output can be visualized using various tools, such as Amazon QuickSight, Tableau, or Power BI.

Analyzing Real-Time Streaming Data

Real-time streaming analytics refer to the process of analyzing and processing data in real-time as it is generated. Streaming analytics are commonly used in use cases such as fraud detection, real-time monitoring, and IoT data processing.

To integrate AWS IoT with EMR for real-time streaming analytics, the following steps can be taken:

AWS IoT: AWS IoT can be used to capture IoT events and send them to Kinesis Data Streams (KDS), a managed real-time data streaming service offered by AWS.
Kinesis Data Streams: KDS allows you to ingest, process, and analyze streaming data with various AWS services. The IoT events can be streamed into KDS.
EMR cluster creation: A transient cluster is created.
Data processing: Spark or Hadoop process and aggregate the IoT event data from KDS in real-time.
Data storage: The processed and aggregated IoT event data can be stored in S3 or Redshift for further analysis.

Analytics: The stored IoT event data can be used for various analytics use cases, such as generating reports, creating dashboards, or building machine learning models.

AWS EMR Architecture with Intel Tiber App-Level Optimization

Intel Tiber App-Level Optimization excels at operating on Amazon EMR when processing large data sets. Intel Tiber App-Level Optimization optimizes resource allocation on YARN on EMR autonomously and continuously, so that your data engineering team doesn’t need to repeatedly manually monitor and tune the workload. Intel Tiber App-Level Optimization also optimizes JVM runtime on EMR workloads.