Hadoop: Ultimate Guide for 2023
What Is Apache Hadoop?
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications. It operates on a scalable cluster of computer servers.
Hadoop is primarily used for advanced analytics applications like predictive analytics, data mining, and machine learning. Because Hadoop systems can handle many types of structured and unstructured data, they provide more flexibility in data collection, processing, analysis, and management than relational databases and data warehouses.
In this article:
- How Does Hadoop Work?
- Hadoop Architecture and Components
- Hadoop Advantages and Challenges
- Hadoop in the Cloud
- Hadoop Alternatives
- Hadoop Best Practices
How Does Hadoop Work?
Hadoop is a framework for distributing large datasets on commodity hardware clusters. Hadoop processes data concurrently on multiple servers.
Here is a simplified description of the Hadoop workflows:
- The client sends data and programs to Hadoop.
- Hadoop File System (HDFS), a core component of Hadoop, handles metadata and manages the distributed file system.
- Hadoop MapReduce then processes and transforms the input and output data.
- YARN, the resource management component, divides the work between clusters.
Hadoop enables efficient use of commodity hardware, with high availability and built-in point of failure detection. It can provide fast response times, even when querying very large data volumes, due to its distributed architecture.
Hadoop Architecture and Components
The Hadoop framework stores and manages big data using parallel processing and distributed storage. There are three types of nodes in a Hadoop cluster:
Master nodes control data storage and parallel data processing functions. They require the best physical hardware resources.
Worker nodes run computations and store data based on a master node’s instructions (most nodes are worker nodes).
Client nodes (also known as edge or gateway nodes) serve as the interface between Hadoop clusters and external networks. They load data into clusters, describe how it should be processed, and retrieve the output.
Hadoop clusters have three functional layers: a storage layer (HDFS), a resource management layer (YARN), and a processing layer (MapReduce). These layers require master-worker interactions.
Hadoop Distributed File System (HDFS) is the storage layer and the framework’s backbone. It manages and stores data in blocks distributed across multiple computers. The default block size is 128MB, although you can easily change this in the config file.
HDFS uses a “write once, read multiple times” approach— you cannot modify stored files, but you can analyze them repeatedly. It enables fast retrieval of an entire data set instead of individual records.
HDFS blocks are automatically replicated across multiple worker nodes (ideally three) for fault tolerance. A backup is always available if one node fails.
The HDFS master node, NameNode, includes metadata about files and tracks the storage capacity. The worker nodes (DataNodes) contain blocks with large files, sending signals to the master node every three seconds.
Yet Another Resource Negotiator (YARN) is Hadoop’s resource management component. It acts as an operating system responsible for scheduling jobs and managing resources to avoid overloading individual machines.
When client machines make a query or fetch code for analysis, the resource manager allocates and manages the resources. All nodes have node managers that monitor their resource usage. The master node requests container resources from the node managers when job requests come in— the resources then go back to YARN.
MapReduce is Hadoop’s processing unit. The worker nodes perform the processing and send the final results to a master node. MapReduce uses coded data to process all the data— a small piece of code (i.e., several KB) can perform heavy-duty processing tasks.
The “map” phase assigns key-value pairs to the data, which are then sorted. The “reduce” phase aggregates the data to obtain a final output.
Learn more in our detailed guide to Hadoop architecture (coming soon)
Hadoop Advantages and Challenges
Let’s review some of the advantages and challenges users face when using Hadoop.
- MapReduce— Hadoop is based on the powerful Map/Reduce algorithm, which can be accessed through Java or Apache Pig. Hadoop performs parallel processing efficiently at almost any scale.
- Cost-effective— traditionally, it is expensive to store large amounts of data. Hadoop solves this problem by storing all raw data for access when needed.
- Highly available— HDFS can run more than one redundant NameNode (HDFS main node) in the same cluster. This allows for quick failover in the event of a system crash or failure.
- Scalabile— users can easily increase storage and processing power by adding more nodes.
- Flexible— Hadoop can handle both structured and unstructured data. It doesn’t require conversion between data formats.
- Active community— Hadoop has a large user base, so it is easy to find helpful documentation and community help when facing problems.
- Rich ecosystem— Hadoop has numerous companion tools and services that can be easily integrated into the platform.
- Small files— HDFS is designed for large scale and lacks the ability to support small files, smaller than the default HDFS block size of 128 MB.
- No real-time processing— Hadoop does not support real-time data processing. It can be combined with Apache Spark or Apache Flink to speed up data processing.
- Security— Hadoop lacks encryption at the storage and network level, putting data at risk. Spark provides important security features that partially solve this problem.
- Response time— the Map/Reduce programming framework is sometimes slow to respond.
- Learning curve— Hadoop has many different modules and services that take a long time to master.
- Complex interface— the native Hadoop interface is not very intuitive, so it may take some time to get used to the platform. However, many tools, extensions, and managed services are available that make Hadoop easier to use.
Hadoop in the Cloud
Hadoop on AWS
Amazon Elastic Map/Reduce (EMR) is a managed service that allows you to process and analyze large datasets using the latest versions of big data processing frameworks such as Apache Hadoop, Spark, HBase, and Presto, on fully customizable clusters.
Key features include:
- Ability to launch Amazon EMR clusters in minutes, with no need to manage node configuration, cluster setup, Hadoop configuration or cluster tuning.
- Simple and predictable pricing— flat hourly rate for every instance-hour, with the ability to leverage low-cost spot Instances.
- Ability to provision one, hundreds, or thousands of compute instances to process data at any scale.
- Amazon provides the EMR File System (EMRFS) to run clusters on demand based on persistent HDFS data in Amazon S3. When the job is done, users can terminate the cluster and store the data in Amazon S3, paying only for the actual time the cluster was running.
Learn more in our detailed guide to Hadoop on AWS (coming soon)
Hadoop on Azure
Azure HDInsight is a managed, open-source analytics service in the cloud. HDInsight allows users to leverage open-source frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, and more, running them in the Azure cloud environment.
Azure HDInsight is a cloud distribution of Hadoop components. It makes it easy and cost-effective to process massive amounts of data in a customizable environment. HDInsights supports a broad range of scenarios such as extract, transform, and load (ETL), data warehousing, machine learning, and IoT.
Here are notable features of Azure HDInsight:
- Read and write data stored in Azure Blob Storage and configure several Blob Storage accounts.
- Implement the standard Hadoop FileSystem interface for a hierarchical view.
- Choose between block blobs to support common use cases like MapReduce and page blobs for continuous write use cases like HBase write-ahead log.
- Use wasb scheme-based URLs to reference file system paths, with or without SSL encrypted access.
- Set up HDInsight as a data source in a MapReduce job or a sink.
HDInsight was tested at scale and tested on Linux as well as Windows.
Learn more in our detailed guide to Hadoop on Azure (coming soon)
Hadoop on Google Cloud
Google Dataproc is a fully-managed cloud service for running Apache Hadoop and Spark clusters. It provides enterprise-grade security, governance, and support, and can be used for general purpose data processing, analytics, and machine learning.
Dataproc uses Cloud Storage (GCS) data for processing and stores it in GCS, Bigtable, or BigQuery. You can use this data for analysis in your notebook and send logs to Cloud Monitoring and Logging.
Here are notable features of Dataproc:
- Supports open source tools, such as Spark and Hadoop.
- Lets you customize virtual machines (VMs) to can scale up and down to meet changing needs.
- Provides on-demand ephemeral clusters to help you reduce costs.
- Integrates tightly with Google Cloud services.
The Spark framework is the most popular alternative to Hadoop. Apache created it as an attachable batch processing system for Hadoop, but it now works as a standalone. Its main advantage over Hadoop is the support for stream (real-time) processing, a growing focus of software companies, given the rise of AI and deep learning.
Spark supports stream processing by relying on in-memory, not disk-based processing. This approach enables a much larger throughput than Hadoop.
Learn more in our detailed guide to Hadoop vs Spark (coming soon)
Storm is another Apache tool designed for real-time processing. It uses workflow topologies that run continuously until a system shutdown or disruption. Storm reads and writes files to HDFS but cannot run on a Hadoop cluster (relying on Zookeeper instead).
A major difference between Storm and Hadoop is how they process data— Hadoop ingests data and distributes it across nodes for processing before pulling it back to HDFS for further use. Storm lacks this discrete beginning and end for processing data— it transforms and analyzes the data fed to it in a continuous stream, enabling complex event processing (CEP).
This platform stores objects on a single node distributed across the network, making object, file, and block-level storage easier. Its main distinguishing feature from Hadoop is the fully distributed architecture with no single point of failure.
Ceph replicates data and is fault-tolerant, eliminating the need for specialized hardware. It helps reduce administration costs, enabling fast identification and fixing of server cluster errors. You can access Ceph storage from Hadoop without HDFS.
Ceph performs better than Hadoop for handling large file systems. The centralized design of HDFS creates a single point of failure, making it less suitable for organizing data into files and folders.
This distributed processing system can handle many big data tasks better than Hadoop. It supports batch operations and streaming. Hydra stores and processes data in trees across many clusters and can handle clusters with hundreds of nodes.
Hydra has a cluster manager to rebalance and allocate jobs to clusters automatically. It uses data replication and automatic node failure handling to achieve fault tolerance.
Learn more in our detailed guide to Hadoop alternatives (coming soon)
5 Hadoop Best Practices
1. Choose the Right Hardware for a Hadoop Cluster
Many organizations struggle with setting up their Hadoop infrastructure. When choosing the right hardware for a Hadoop cluster, consider the following:
- The amount of data the cluster will process.
- The type of workload handled by the cluster (CPU bound or I/O bound).
- How the data is stored— the data container, and the data compression used (if any).
- A data retention policy that indicates how long to retain data before it is refreshed.
2. Manage Hadoop Clusters for Production Workloads
Key requirements for Hadoop clusters in production include:
- 24/7 high availability of clusters with automated resource provisioning. This should include redundant HDFS NameNodes with load balancing, hot standby, synchronization, and automatic failover.
- Continuous workload management including security, health monitoring, performance optimization, job scheduling, policy management, backup and recovery.
- Policy-based controls to prevent applications from disproportionately sharing resources across a Hadoop cluster.
- Regression tests to manage the distribution of the software layer in your Hadoop cluster. This ensures tasks and data do not clash or become bottlenecks in day-to-day operations.
These capabilities can be difficult to achieve in a typical organization, and are typically provided by Hadoop management tools. Another option is to use managed Hadoop services, which provide many of these capabilities built-in.
Learn more in our detailed guide to Hadoop cluster (coming soon)
3. Use Skewed Joins
When using Pig or Hive, standard joins in transformation can result in erratic performance for Map/Reduce jobs, because it might result in too much data being sent to one reducer. If there is too much data associated with one key, one of the reducers will receive too much data and become a bottleneck.
Skewed joins can solve this problem, by compute histograms that determine which keys dominate, and partitioning the data in such a way that it will be evenly distributed between reducers.
4. Write a Combiner
In addition to data compression techniques, it is beneficial to create “combiners” that can reduce the amount of data transferred within a Hadoop environment. Combiners can help optimize performance, especially if a job performs large shuffles and the map output is at least a few GB per node, or if a job performs aggregate sorts.
Combiners act as an optimizer for Map/Reduce jobs. They work on the output of the Map step to reduce the number of intermediate keys passed to the reducer. This offloads business logic from the Reduce step, improving performance.
5. Securing Hadoop with Knox and Ranger
The Hadoop ecosystem has two important projects that can help improve security: Knox and Ranger.
The Ranger project helps users deploy and standardize security across Hadoop clusters. It provides a centralized framework for managing policies at the resource level, such as files, folders, databases, or specific rows or columns in the database.
Ranger helps administrators enforce access policies based on groups, data types, and more. Ranger has different authorization capabilities for various Hadoop components such as YARN, HBase and Hive.
Know is a REST API gateway developed within the Apache community, which supports monitoring, authorization management, auditing, and policy enforcement of Hadoop clusters. It provides a single point of access for all REST interactions with the cluster.
With Knox, system administrators can manage authentication via LDAP and Active Directory, implement HTTP header-based federated identity management, and audit hardware in a cluster. Knox can integrate with enterprise identity management solutions and is Kerberos compatible, enhancing security in Windows environments.
Learn more in our detailed guides to
- Hadoop tools (coming soon)
- Hadoop security (coming soon)