Ultimate Guide to AWS EMR: Use Cases, How It Works, Pricing & More

What is AWS EMR?

Amazon EMR (formerly Amazon Elastic MapReduce) is a big data platform by Amazon Web Services (AWS). This low-configuration service provides an alternative to in-house cluster computing, enabling you to run big data processing and analyses in the AWS cloud.

Based on Apache Hadoop and Apache Spark, EMR enables you to process massive volumes of unstructured data in parallel across stand-alone computers or a cluster of distributed processors. The Hadoop programming framework is Java-based and lets you process large data sets in distributed computing environments – it was initially created by Google to index web pages in 2004.

AWS EMR processes data across a Hadoop cluster of virtual servers, running on Amazon Elastic Compute Cloud (EC2), Elastic Kubernetes Service (EKS), or Amazon Outposts (on-premises). It employs dynamic resizing to increase or reduce resource usage according to changing demands.

This is part of an extensive series of guides about IaaS.

In this article:

Amazon EMR Features
Amazon EMR Deployment Options
What Are the Main Use Cases of Amazon EMR?
How Amazon EMR Works
Amazon EMR Pricing
5 AWS EMR Best Practices
Optimizing Amazon EMR With Intel® Tiber™ App-Level Optimization

Amazon EMR Features

EMR offers the following features:

Elastic Scalability

AWS EMR allows users to easily scale their processing capacity up or down based on their current needs. This is particularly useful for dealing with temporary, large-scale workloads which are typical in Hadoop data processing. This flexibility is achieved through integration with AWS services like EC2 and EKS, which allow EMR to dynamically add or remove resources as needed.

High Availability

High availability in AWS EMR is achieved through multi-availability zone deployment and data replication. EMR is designed to automatically replicate data across different AWS Availability Zones to prevent data loss in case of a hardware failure. The service can automatically restart failed tasks and reassign them to other nodes in the cluster, ensuring that your data processing jobs are fault tolerant.

Flexible Data Stores

Users can choose between Hadoop Distributed File System (HDFS), EMR File System (EMRFS), or local file systems to store datasets for processing. EMRFS extends the functionality of EMR by allowing direct integration with Amazon S3, providing a durable, scalable, and secure data storage solution. This flexibility ensures that users can optimize their storage strategy based on cost, performance, and data access patterns, while making EMR compatible with legacy Hadoop infrastructure.

Data Access Control

Data access in EMR is managed through integration with AWS Identity and Access Management (IAM). IAM allows users to define policies that grant or deny access to EMR clusters and the data within them.

This includes specifying which users or services can create, modify, or delete EMR clusters, as well as who can submit jobs or access the data processed by those jobs. Such granular access control is crucial for maintaining data security and compliance.

Amazon EMR Deployment Options

There are several ways to deploy EMR:

Amazon EMR on Amazon EC2

Deploying AWS EMR on Amazon EC2 provides the most control and flexibility over your EMR clusters. You can choose from a variety of EC2 instance types to optimize for your specific workload, whether it requires high CPU, memory, storage, or I/O capacity. This option also allows for custom configurations and tuning of clusters to meet specific performance or security requirements.

Amazon EMR on Amazon EKS

For users already invested in Kubernetes, AWS EMR on Amazon EKS offers a way to run EMR workloads in a managed Kubernetes environment. This option simplifies running big data frameworks on EKS, allowing you to leverage EMR’s data processing capabilities while using Kubernetes to manage the compute infrastructure.

Amazon EMR on AWS Outposts

AWS Outposts is an Amazon hardware device that can be deployed on-premises to run AWS services within an organization’s data center. AWS EMR on Outposts extends EMR’s capabilities to on-premises environments. This option is suitable for organizations that need to keep their data processing activities close to their data sources due to latency or regulatory requirements, and makes EMR suitable for hybrid cloud environments.

What Are the Main Use Cases of Amazon EMR?

Here are some of the main use cases for EMR:

Extract, Transform and Load (ETL)

Amazon EMR is highly effective for ETL tasks, which involve extracting data from various sources, transforming it into a structured format, and loading it into a data warehouse or database. EMR’s distributed computing power accelerates the processing of large data sets, making it suitable for daily, hourly, or real-time ETL jobs.

EMR can efficiently process and transform raw data into valuable insights, enabling data analytics and business intelligence activities. The integration with AWS services like S3 for storage and Redshift for data warehousing enhances ETL capabilities.

Machine Learning

Machine learning projects benefit from Amazon EMR’s capacity to handle large datasets. EMR supports machine learning projects by providing the computational power needed to process and analyze vast amounts of data.

Data scientists can use EMR to pre-process massive datasets, perform exploratory data analysis, and execute complex algorithms at scale. EMR’s integration with AWS services like Amazon S3 and DynamoDB allows for seamless data ingestion and storage.

Real-Time Streaming

For applications requiring real-time data processing, Amazon EMR supports streaming data analysis. This allows organizations to gain immediate insights from data as it’s generated, enabling real-time decision-making.

EMR can be used with Apache Kafka and Apache Flink, among others, to process and analyze streaming data for use cases such as real-time analytics, event detection, and streaming ETL. This capability is useful for industries like finance, where real-time data processing can inform trading decisions, or in eCommerce, where instant analytics can improve customer experiences.

Clickstream Analysis

Clickstream analysis involves analyzing and interpreting data about the sequence of actions users take on a website or application. This is helpful for understanding user behavior, optimizing website navigation, and improving user engagement.

By processing clickstream data with EMR, organizations can identify trends, detect anomalies, and personalize the user experience based on real-time user activity. EMR’s scalability ensures that websites with millions of daily visitors can analyze data quickly, providing insights into user journeys and contributing to strategic decision-making.

How Does Amazon EMR Work?

EMR’s service architecture consists of layers that provide different functionality, including:

Storage

EMR’s storage layer includes the cluster’s file systems. Here are the main storage options:

Hadoop Distributed File System (HDFS)

EMR employs this scalable, distributed file system to store input and output data in multiple instances across a cluster. HDFS can store several data copies on different instances, keeping the data safe even if one instance fails. Note that AWS reclaims this ephemeral storage when you terminate the cluster. You should use SDFS to cache intermediate data during EMR processing or for workloads with significant random I/O.

EMR File System (EMRFS)

Amazon EMR leverages EMRFS to extend Hadoop and access data directly in S3. It lets you use S3 or HDFS as the file system in a cluster. Many users choose S3 to store output and input data and HDFS to store intermediate results.

Local File System

This file system is a locally-connected disk. When creating a Hadoop cluster, EMR creates all nodes from Amazon EC2 instances with instance stores – pre-attached, preconfigured blocks of disk storage. However, data on an instance store volume can persist only during the lifecycle of an attached EC2 instance.

Cluster Resource Management

This layer lets you manage cluster resources centrally and schedule data processing jobs. EMR also uses YARN (Yet Another Resource Negotiator), introduced in Hadoop 2.0 to enable central management of cluster resources for several data-processing frameworks, by default. However, some frameworks and applications in EMR do not use YARN.

EMR places an agent on every node administering YARN components to keep the cluster healthy. It also includes default functionality for scheduling YARN jobs to prevent jobs from failing when nodes that run on Spot instances are terminated. It works by allowing master application processes to run exclusively on core nodes. Since the application master process controls your running jobs, it must stay alive.

Learn more in our detailed guide to AWS EMR cluster

Data Processing Frameworks

EMR uses this layer as an engine for data processing and analysis, allowing you to use various frameworks that run on YARN or with their resource management.

Hadoop MapReduce

This open source distributed computing framework lets you write parallel distributed applications. It abstracts all of the logic, allowing you to provide mainly the following functions:

Map functions – map data to intermediate results, sets of key-value pairs.
Reduce functions – combine intermediate results, apply additional algorithms, and produce a final output.

You can automatically generate map and reduce programs using various tools, such as Hive.

Apache Spark

This open source framework helps process big data workloads. Unlike Hadoop MapReduce, Apache Spark leverages directed acyclic graphs to execute in-memory caching and plans for your datasets. Running Spark on Amazon EMR enables you to use EMRFS to access data directly in S3. Spark supports several interactive query modules, including SparkSQL.

Learn more in our detailed guide to AWS EMR architecture

Amazon EMR Studio

This integrated development environment (IDE) provides fully-managed Jupyter notebooks you can run on AWS EMR clusters. EMR Studio lets you develop, debug, and visualize Scala, R, PySpark, and Python applications. AWS allows you to use EMR Studio for free, applying charges only for S3 storage and EMR clusters.

The platform integrates with AWS IAM (Identity and Access Management) and IAM Identity Center, allowing users to log in with their existing company credentials. It lets you access and launch EMR clusters on demand to run your Jupyter notebook jobs and explore and save notebooks.

Amazon EMR Studio features

You can use various languages to analyze data, including Python, Spark Scala, PySpark, SparkSQL, and Spark R, and install custom libraries and kernels. The platform enables you to collaborate in real-time with users in the same workspace and link code repositories like BitBucket and GitHub.

The platform allows you to use SQL Explorer to run various SQL queries, browse data catalogs, and download the results before working with the data in your notebook. You leverage orchestration tools like Apache Airflow and Managed Workflows for Apache Airflow to run parameterized Jupyter notebooks as part of a scheduled workflow and the Tez UI, YARN timeline, or Spark History server to track and debug jobs.

Amazon EMR Pricing

EMR pricing differs depending on the deployment model you select.

EMR Pricing on Amazon EC2

When using Amazon EMR, there is an additional charge for the EMR service on top of the EC2 instance costs. For example, looking at the general purpose, current generation instances in the US East (Ohio) region, the combined cost for using EMR on an m7a.xlarge EC2 instance amounts to $0.23184 for EC2 plus $0.05796 for EMR, totaling $0.2898 per hour on demand.

The cost of EMR increases with instance size. For example, an m7a.2xlarge EC2 instance with EMR will cost $0.46368 for EC2 and $0.11592 for EMR, resulting in a combined hourly rate of $0.5796.

These prices reflect the on-demand rates, which provide flexibility without any upfront payment or long-term commitment. For cost optimization, users can consider Reserved Instances or Capacity Savings Plans for longer-term projects, or leverage Spot Instances for non-critical workloads, saving up to 90% compared to on-demand rates.

It’s important to note that if you attach Amazon Elastic Block Store (EBS) volumes, that will incur additional costs, and these are also billed per second, subject to a one-minute minimum.

EMR Pricing on Amazon EKS

Amazon EMR pricing on Amazon Elastic Kubernetes Service (EKS) is based on the resources consumed by your Kubernetes pods, with a focus on the virtual CPU (vCPU) and memory resources used. Unlike fixed pricing tiers for EC2 instances, the costs for EMR on EKS are calculated dynamically, which means that you only pay for what you use, down to the second..

For example, in the US East (Ohio) region, the rate for EMR on EKS is charged at $0.01012 per vCPU hour and $0.00111125 per GB hour of memory used.

EMR Pricing on AWS Outposts

The pricing for Amazon EMR deployed on AWS Outposts is consistent with cloud-based instances of EMR.

For example, if an organization opts for general-purpose EC2 instances for their EMR clusters on Outposts, they can choose from various configurations such as the m5.12xlarge or the m6i.24xlarge, which in a cloud-based setting would cost $9.10074 per hour and $18.8669 per hour with no upfront payment, respectively.

For compute-optimized instances like the c5.24xlarge, the rate would be $4.08576 per hour with no upfront payment.

Memory-optimized instances are also available for EMR on Outposts. For example, an r5.24xlarge instance would be priced at $5.42178 per hour with no upfront payment. Specialized configurations for accelerated computing tasks, such as the g4dn.12xlarge, are priced at $3.14392 per hour with no upfront payment.

5 AWS EMR Best Practices

1. Compress Mapper Outputs in Memory

EMR lets you compress the map function’s memory footprint to ensure large jobs complete quickly. Compressing the mappers’ output in memory can help prevent the output from being written to disk. It is important when there EMR needs to map a large amount of data. Go to the core node properties to enable this option.

2. Cluster in VPC

You can use the EC2-VPC platform to launch and manage AWS EMR clusters instead of EC2-Classic. Here are the benefits of clustering in VPC:

Improved networking infrastructure – access to features that enable network isolation, private IP addresses, and private subnets.
Flexible control over access security – EC2-VPC lets you use network access control lists (ACLs), security group outbound/egress traffic filtering, and other features you can use to protect sensitive data in EMR clusters.
Newer EC2 instance types – you gain access to various EC2 instance types, such as C4, R4, and M4, when using EC2-VPC for your clusters.

3. EMR Cluster Logging

EMR automatically deletes log files from clusters by default at the end of the retention period. However, you can enable a feature that uploads log files from your cluster’s master instances to S3 to save the logging data for troubleshooting or compliance purposes. You can save step logs, instance state logs, and Hadoop logs. Note that this feature archives and sends EMR log files to S3 at 5-minute intervals.

4. EMR In-Transit and At-Rest Encryption

Encryption helps protect data from being used if intercepted by threat actors. You should implement encryption in transit and at rest when working with production data. Encryption helps protect data from unauthorized access and satisfy compliance requirements. It is especially important when handling sensitive data, such as personally identifiable information (PII).

5. EMR Instances Count

You should set limits for the maximum number of provisioned EMR cluster instances in an AWS account. It can help you quickly mitigate attacks, better manage EMR compute resources, and avoid unexpected AWS charges. Otherwise, users in your organization can exceed the monthly cloud computing budget, and threat actors can create many EMR resources in your account, leading you to accrue significant AWS charges.

Optimizing Amazon EMR With Intel® Tiber™ App-Level Optimization

App-Level Optimization excels at operating on Amazon EMR when processing large data sets. Intel Tiber App-Level Optimization improves Yarn on EMR by optimizing resource allocation autonomously and continuously, so that data engineering teams don’t need to repeatedly manually monitor and tune the workload. App-Level Optimization also optimizes JVM runtime on EMR workloads.

See Additional Guides on Key IaaS Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of IaaS.

Ultimate Guide to AWS EMR: Use Cases, How It Works, Pricing and More

Bar Yochai Shaya

Director of Solution Engineering & Technical Sales, Intel Granulate