Running Hadoop on AWS: The Basics and 5 Tips for Success
What is Hadoop on AWS?
You can run Apache Hadoop on AWS using Amazon EMR, a managed service for processing and analyzing large datasets. Amazon EMR does not only work with Hadoop—it is also compatible with other big data processing frameworks like Apache Spark, Presto, and HBase.
Hadoop is an open source version of the technology that powers Google’s search index, known as Map/Reduce. It enables organizations to process data in parallel on a large number of computers, making it possible to quickly analyze huge datasets.
Related content: Read our guide to Hadoop vs. Spark
In this article:
- Hadoop on AWS Components
- AWS EMR Best Practices
Hadoop on AWS Components
Amazon EMR installs and configures applications programmatically inside a Hadoop project, including YARN, Hadoop MapReduce, Apache Tez, and HDFS across all nodes in a cluster.
Hadoop MapReduce and Tez help process workloads, and YARN helps manage resources when using Hadoop 2 and later versions.
Hadoop MapReduce and Tez
Hadoop MapReduce and Tez are execution engines that process workloads. They use frameworks that break jobs into smaller pieces that can be distributed across nodes in an EMR cluster. Here is what you should expect from these engines:
- Fault tolerance—MapReduce and Tez are designed for fault tolerance, built to assume machines in a cluster can fail at any time. If a server that runs a task fails, Hadoop reruns the task on another machine until completion.
- Compatible with tools—you can write Tez and MapReduce programs in Java, using Hadoop Streaming to execute custom scripts in parallel. Hive and Pig can help with higher-level abstractions over Tez and MapReduce.
Yet Another Resource Negotiator (YARN)
Hadoop 2 and later versions manage resources with YARN, using it to keep track of all resources across a cluster. It ensures resources are dynamically allocated to accomplish all tasks in a processing job. YARN can manage Tez and MapReduce workloads and other distributed frameworks like Apache Spark.
Storage on AWS
You can use the EMR File System (EMRFS) on an EMR cluster to incorporate Amazon S3 as a data layer for Hadoop. Amazon S3 offers highly scalable, affordable, and durable storage that can work well for big data processing.
Storing data on S3 lets you decouple the compute layer from the storage layer. You can use it to size an EMR cluster for the required amount of memory and CPU for workloads rather than setting up extra nodes in the cluster to maximize on-cluster storage. It also enables you to terminate an idle EMR cluster to save costs while keeping your data in S3.
Optimized for Hadoop
EMRFS can directly read and write in parallel to S3 performantly. It can also process objects encrypted with S3 client-side and server-side encryption. EMRFS enables you to use S3 as a data lake and set up Hadoop in EMR as an elastic query layer.
The Hadoop distributed file system (HDFS) stores data in large blocks across your cluster’s local disks. It includes a configurable replication factor that increases availability and durability. HDFS can monitor replication and balance data across nodes as some nodes fail and new nodes are added.
HDFS is installed with Hadoop on an EMR cluster, allowing you to use HDFS with S3 to store input and output data. You can use an EMR security configuration to encrypt HDFS. EMR also configures Hadoop to use HDFS and local disk for intermediate data created during Hadoop MapReduce jobs, even if the input data is located in S3.
AWS EMR Best Practices
1. Run Your Cluster In a VPC
Instead of using EC2-Classic, you can use the EC2-VPC platform to launch and manage AWS EMR clusters. Here are several advantages:
- Improves networking infrastructure—it provides capabilities like network isolation, private IP addresses, and private subnets.
- Flexible control over access security—lets you use network ACLs and security group outbound/egress traffic filtering.
- EC2 instance types—provides access to powerful, newer EC2 instance types such as C4, R4, and M4 for clusters.
2. Monitor EMR Instances Counts
You must monitor and establish limits for the maximum Elastic MapReduce cluster instances provisioned within an AWS account. It helps manage EMR compute resources more accurately, prevent unexpected expenses, and enables you to respond quickly to attacks. You can use it to restrict the number of instances per user to ensure compromised accounts cannot lead to a large-scale attack.
3. Use EMR Cluster Logging
By default, AWS automatically deletes all EMR log files from clusters after the retention period ends. When you enable cluster logging, Elastic MapReduce uploads all log files from the cluster’s master instance(s) to S3. It ensures you can later utilize logging data, including step, Hadoop, and instance state logs, for troubleshooting and compliance.
4. Use EMR In-transit and At-rest Encryption
You should always implement encryption when working with production data. It helps protect your data from unauthorized access and satisfies compliance requirements for data-in-transit and data-at-rest encryption. Many regulatory entities, including the GDPR and HIPAA, protect sensitive data that can potentially identify an individual.
5. Consider Compressing Mapper Outputs in Memory
EMR enables you to compress the map function’s memory footprint. It is critical for large jobs that must be completed quickly. When you need to map a large amount of data, compressing the output of mappers in memory can prevent the output from being written to disk. You can enable this option in the core node properties.