AWS EMR Tutorial: Configuring & Managing Your First Cluster

What Is AWS EMR?

Amazon EMR (Amazon Elastic MapReduce) is a managed platform for cluster-based workloads. It enables you to run a big data framework, like Apache Spark or Apache Hadoop, on the AWS cloud to process and analyze massive amounts of data.

Organizations employ AWS EMR to process big data for business intelligence (BI) and analytics use cases. It also enables organizations to transform and migrate between AWS databases and data stores, including Amazon DynamoDB and the Simple Storage Service (S3).

In this article:

Getting Started With Amazon EMR
Tutorial: Getting Started With Amazon EMR

Getting Started with Amazon EMR

Use the following steps to sign up for Amazon Elastic MapReduce:

Go to the Amazon EMR page: http://aws.amazon.com/emr.
Click on the Sign Up Now button.
If you have not signed up for Amazon S3 and EC2, the EMR sign-up process prompts you to do so.

AWS lets you deploy workloads to Amazon EMR using any of these options:

Amazon EC2
On-premises AWS Outposts
Amazon Elastic Kubernetes Service (EKS)

Once you set this up, you can start running and managing workloads using the EMR Console, API, CLI, or SDK. You can use Managed Workflows for Apache Airflow (MWAA) or Step Functions to orchestrate your workloads. Additionally, AWS recommends SageMaker Studio or EMR Studio for an interactive user experience.

Learn more in our detailed guide to AWS EMR architecture

Tutorial: Getting Started with Amazon EMR

Here is a tutorial on how to set up and manage an Amazon Elastic MapReduce (EMR) cluster. The following image shows a typical EMR workflow.

Image Source: AWS

Step 1: Plan and Configure an EMR Cluster

To set up a cluster:

Determine your data processing needs: First, consider the size and complexity of your data, as well as the type of processing you need to perform (e.g., batch processing, stream processing, machine learning). This will help you determine the number and type of Amazon Elastic Compute Cloud (EC2) instances you’ll need in your EMR cluster.
Choose a cluster configuration: EMR offers a range of cluster configurations, including on-demand and spot instances, instance fleets, and custom AMIs. Select the configuration that best meets your needs and budget.
Select a Hadoop version and applications: EMR supports a variety of Hadoop versions and applications, such as Apache Spark and Apache Flink. Select the version and applications that are best suited for your data processing needs.
Choose a storage option: EMR supports a range of storage options, including Amazon S3, Amazon EBS, and local storage. Select the option that is most appropriate for your data and processing needs.
Configure security and access: EMR clusters can be configured with a variety of security and access options, such as security groups, network ACLs, and IAM roles. Select the options that provide the appropriate level of security for your data and applications.
Launch and monitor the cluster: Once you have configured your EMR cluster, you can launch it and begin processing and analyzing your data. You can monitor the cluster using the EMR console, CloudWatch, and other tools to ensure that it is running smoothly and efficiently.

Step 2: Manage the EMR Cluster

Use the following options to manage your cluster:

The EMR console: A web-based interface that allows you to create, configure, and manage EMR clusters. You can use the console to view the status of your cluster, monitor its health and performance, and troubleshoot any issues that may arise.
CloudWatch: A monitoring service that allows you to monitor the health and performance of your EMR cluster in real-time. You can use CloudWatch to set alarms and receive notifications when certain conditions are met, such as when the cluster is running low on resources or when an error occurs.
EMR CLI: A command-line tool that allows you to manage your EMR clusters from the command line. You can use the CLI to perform tasks such as creating and terminating clusters, adding and removing instances, and submitting and canceling jobs.
EMR API: Allows you to manage your EMR clusters programmatically using a variety of programming languages. You can use the API to automate common tasks, such as creating and scaling clusters, and to integrate EMR with other AWS services.
EMR Studio: A web-based integrated development environment (IDE) that allows you to develop, test, and debug big data applications using EMR. You can use EMR Studio to create and edit code, run and debug jobs, and collaborate with other developers.

Here is an example of how to view the output of a “step” in Amazon EMR using Amazon Simple Storage Service (S3):

Go to the AWS website and sign in to your AWS account. Then, navigate to the EMR console by clicking the Services menu and selecting EMR under the Analytics category.
On the EMR dashboard, select the cluster that contains the step whose results you want to view.
On the cluster details page, click the Steps tab to view a list of all of the steps that have been run on the cluster.
Locate the step whose results you want to view in the list of steps. The status of the step will be displayed next to it.
To view the results of the step, click on the step to open the step details page. On the step details page, you will see a section called Output location, which displays the S3 bucket and folder where the results of the step are stored. You can click on the link to open the S3 console and view the contents of the folder.

Step 3: Clean Up Your Amazon EMR Resources

By regularly reviewing your EMR resources and deleting those that are no longer needed, you can ensure that you are not incurring unnecessary costs, maintain the security of your cluster and data, and manage your data effectively. To clean up resources:

Terminate the cluster: When you no longer need a cluster, you can terminate it to release the associated EC2 instances and EBS volumes. To do this, use the EMR console, the EMR CLI, or the EMR API.
Delete unused Amazon S3 buckets: If you have created Amazon S3 buckets to store data or intermediate results for your EMR cluster, make sure to delete these buckets when they are no longer needed. You can do this using the Amazon S3 console or the Amazon S3 API.
Delete security groups, IAM roles, and network ACLs: Make sure to delete these resources when they are no longer needed. This can help to prevent unauthorized access to your cluster and reduce your AWS costs.

To delete Amazon Simple Storage Service (S3) resources, you can use the Amazon S3 console, the Amazon S3 API, or the AWS Command Line Interface (CLI). Here are the steps to delete S3 resources using the Amazon S3 console:

Go to the Amazon S3 console at https://console.aws.amazon.com/s3/.
In the Buckets or Objects list, select the checkbox next to the resource you want to delete. You can select multiple resources by holding down the “Ctrl” key and clicking the checkboxes.
Once you have selected the resources you want to delete, click the Delete button.
A dialog box will appear asking you to confirm the deletion. Click Delete to confirm.

Please note that once you delete an S3 resource, it is permanently deleted and cannot be recovered. It is important to be careful when deleting resources, as you may lose important data if you delete the wrong resources by accident.

If you want to delete all of the objects in an S3 bucket, but not the bucket itself, you can use the “Empty bucket” feature in the Amazon S3 console. This will delete all of the objects in the bucket, but the bucket itself will remain. You can then delete the empty bucket if you no longer need it.

Optimizing Amazon EMR With Granulate

Granulate excels at operating on Amazon EMR when processing large data sets. Granulate optimizes Yarn on EMR by optimizing resource allocation autonomously and continuously, so that data engineering teams don’t need to repeatedly manually monitor and tune the workload. Granulate also optimizes JVM runtime on EMR workloads.

AWS EMR Tutorial: Configuring and Managing Your First Cluster

Omer Mesika

Director of Solution Engineering, Intel Granulate