What Is AWS EMR and 5 Critical Best Practices
Amazon EMR (formerly Amazon Elastic MapReduce) is a big data platform by Amazon Web Services (AWS). This low-configuration service provides an alternative to in-house cluster computing, enabling you to run big data processing and analyses in the AWS cloud.
Based on Apache Hadoop, EMR enables you to process massive volumes of unstructured data in parallel across stand-alone computers or a cluster of distributed processors. The Hadoop programming framework is Java-based and lets you process large data sets in distributed computing environments – it was initially created by Google to index web pages in 2004.
AWS EMR processes data across a Hadoop cluster of virtual servers on Amazon Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2). It employs dynamic resizing to increase or reduce resource usage according to changing demands.
In this article:
- Amazon EMR Architecture
- AWS EMR Best Practices
Amazon EMR Architecture
EMR’s service architecture consists of layers that provide different functionality, including:
EMR’s storage layer includes the cluster’s file systems. Here are the main storage options:
Hadoop Distributed File System (HDFS)
EMR employs this scalable, distributed file system to store input and output data in multiple instances across a cluster. HDFS can store several data copies on different instances, keeping the data safe even if one instance fails. Note that AWS reclaims this ephemeral storage when you terminate the cluster. You should use SDFS to cache intermediate data during EMR processing or for workloads with significant random I/O.
EMR File System (EMRFS)
Amazon EMR leverages EMRFS to extend Hadoop and access data directly in S3. It lets you use S3 or HDFS as the file system in a cluster. Many users choose S3 to store output and input data and HDFS to store intermediate results.
Local File System
This file system is a locally-connected disk. When creating a Hadoop cluster, EMR creates all nodes from Amazon EC2 instances with instance stores – pre-attached, preconfigured blocks of disk storage. However, data on an instance store volume can persist only during the lifecycle of an attached EC2 instance.
Cluster Resource Management
This layer lets you manage cluster resources centrally and schedule data processing jobs. EMR also uses YARN (Yet Another Resource Negotiator), introduced in Hadoop 2.0 to enable central management of cluster resources for several data-processing frameworks, by default. However, some frameworks and applications in EMR do not use YARN.
EMR places an agent on every node administering YARN components to keep the cluster healthy. It also includes default functionality for scheduling YARN jobs to prevent jobs from failing when nodes that run on Spot instances are terminated. It works by allowing master application processes to run exclusively on core nodes. Since the application master process controls your running jobs, it must stay alive.
Learn more in our detailed guide to AWS EMR cluster
Data Processing Frameworks
EMR uses this layer as an engine for data processing and analysis, allowing you to use various frameworks that run on YARN or with their resource management.
This open source distributed computing framework lets you write parallel distributed applications. It abstracts all of the logic, allowing you to provide mainly the following functions:
- Map functions – map data to intermediate results, sets of key-value pairs.
- Reduce functions – combine intermediate results, apply additional algorithms, and produce a final output.
You can automatically generate map and reduce programs using various tools, such as Hive.
This open source framework helps process big data workloads. Unlike Hadoop MapReduce, Apache Spark leverages directed acyclic graphs to execute in-memory caching and plans for your datasets. Running Spark on Amazon EMR enables you to use EMRFS to access data directly in S3. Spark supports several interactive query modules, including SparkSQL.
Learn more in our detailed guide to AWS EMR architecture (coming soon)
Amazon EMR Studio
This integrated development environment (IDE) provides fully-managed Jupyter notebooks you can run on AWS EMR clusters. EMR Studio lets you develop, debug, and visualize Scala, R, PySpark, and Python applications. AWS allows you to use EMR Studio for free, applying charges only for S3 storage and EMR clusters.
The platform integrates with AWS IAM (Identity and Access Management) and IAM Identity Center, allowing users to log in with their existing company credentials. It lets you access and launch EMR clusters on demand to run your Jupyter notebook jobs and explore and save notebooks.
Amazon EMR Studio features
You can use various languages to analyze data, including Python, Spark Scala, PySpark, SparkSQL, and Spark R, and install custom libraries and kernels. The platform enables you to collaborate in real-time with users in the same workspace and link code repositories like BitBucket and GitHub.
The platform allows you to use SQL Explorer to run various SQL queries, browse data catalogs, and download the results before working with the data in your notebook. You leverage orchestration tools like Apache Airflow and Managed Workflows for Apache Airflow to run parameterized Jupyter notebooks as part of a scheduled workflow and the Tez UI, YARN timeline, or Spark History server to track and debug jobs.
5 AWS EMR Best Practices
1. Compress Mapper Outputs in Memory
EMR lets you compress the map function’s memory footprint to ensure large jobs complete quickly. Compressing the mappers’ output in memory can help prevent the output from being written to disk. It is important when there EMR needs to map a large amount of data. Go to the core node properties to enable this option.
2. Cluster in VPC
You can use the EC2-VPC platform to launch and manage AWS EMR clusters instead of EC2-Classic. Here are the benefits of clustering in VPC:
- Improved networking infrastructure – access to features that enable network isolation, private IP addresses, and private subnets.
- Flexible control over access security – EC2-VPC lets you use network access control lists (ACLs), security group outbound/egress traffic filtering, and other features you can use to protect sensitive data in EMR clusters.
- Newer EC2 instance types – you gain access to various EC2 instance types, such as C4, R4, and M4, when using EC2-VPC for your clusters.
3. EMR Cluster Logging
EMR automatically deletes log files from clusters by default at the end of the retention period. However, you can enable a feature that uploads log files from your cluster’s master instances to S3 to save the logging data for troubleshooting or compliance purposes. You can save step logs, instance state logs, and Hadoop logs. Note that this feature archives and sends EMR log files to S3 at 5-minute intervals.
4. EMR In-Transit and At-Rest Encryption
Encryption helps protect data from being used if intercepted by threat actors. You should implement encryption in transit and at rest when working with production data. Encryption helps protect data from unauthorized access and satisfy compliance requirements. It is especially important when handling sensitive data, such as personally identifiable information (PII).
5. EMR Instances Count
You should set limits for the maximum number of provisioned EMR cluster instances in an AWS account. It can help you quickly mitigate attacks, better manage EMR compute resources, and avoid unexpected AWS charges. Otherwise, users in your organization can exceed the monthly cloud computing budget, and threat actors can create many EMR resources in your account, leading you to accrue significant AWS charges.
Optimizing Amazon EMR With Granulate
Granulate excels at operating on Amazon EMR when processing large data sets. Granulate optimizes Yarn on EMR by optimizing resource allocation autonomously and continuously, so that data engineering teams don’t need to repeatedly manually monitor and tune the workload. Granulate also optimizes JVM runtime on EMR workloads.