Back to blog

Hadoop Cluster: Architecture, Pros/Cons, and a Quick Tutorial

Alon Berger

Product Marketing Manager, Intel Granulate

What Is a Hadoop Cluster? 

A Hadoop cluster is a group of machines used for storing and analyzing unstructured data in a distributed environment. It uses the Hadoop software framework to process large data sets across clusters of computers using simple programming models. A key feature of Hadoop clusters is their ability to scale up from a single server to thousands of machines, each offering local computation and storage.

The architecture of a Hadoop cluster ensures that any failures in individual components (such as hard drives or nodes) do not affect the availability or processing of data. This resilience is achieved through data replication and fault tolerance mechanisms. This makes Hadoop highly reliable for processing and storing large datasets. 

In this article:

What Is a Hadoop Cluster Used For? 

Hadoop clusters are primarily used for big data analytics, processing vast amounts of unstructured and semi-structured data from various sources such as social media, eCommerce, sensors, and IoT devices. 

They enable organizations to run analytical algorithms on large datasets to uncover patterns, trends, and insights that were previously hidden or too complex to resolve. This capability is useful in fields like marketing analysis, financial forecasting, health informatics, and scientific research.

Hadoop clusters also enable cost-effective scaling. As data volume grows, additional nodes can be added to the cluster using commodity hardware. This scalability allows organizations to keep pace with their data processing needs without incurring the high costs associated with traditional relational database systems or data warehouses. 

Hadoop Cluster Architecture

Hadoop clusters include the following components.

Master Nodes 

Master nodes manage cluster operations, including storing data in the Hadoop Distributed File System (HDFS) and executing parallel computations on this data using frameworks like MapReduce. These nodes coordinate the activities of worker nodes, ensuring efficient task allocation and resource management. 

They house critical components such as the NameNode, which manages the filesystem metadata; and the JobTracker, which oversees job scheduling and task distribution across worker nodes.

To maintain high availability and reliability, master nodes are typically configured on machines with superior hardware specifications compared to worker or client nodes. This ensures that management tasks do not become bottlenecks during intensive data processing activities.  

Worker Nodes 

Worker nodes execute the tasks assigned by the master nodes. Each worker node runs a DataNode and TaskTracker service. The DataNode manages data storage within HDFS, while the TaskTracker oversees task execution. This dual role enables worker nodes to handle data storage and processing, enabling parallel computation.

The capacity of a Hadoop cluster to process vast amounts of data relies on these worker nodes. Their configuration on commodity hardware allows for cost-effective scaling, ensuring that as data processing demands increase, additional nodes can be added without significant investment.  

Client Nodes 

Client nodes interface with the Hadoop cluster, initiating and managing the execution of jobs. They are responsible for loading data into the cluster, submitting MapReduce jobs to master nodes, and retrieving the results upon completion. This interaction is enabled through command-line tools or APIs provided by Hadoop, allowing users to specify job configurations and data processing logic.

Unlike master and worker nodes, client nodes do not store data or execute tasks within the cluster’s distributed computing environment. They serve as a conduit between the user and the cluster, translating job requests into actions executed by worker nodes under the coordination of master nodes.

Learn more in our detailed guide to Hadoop architecture

What Is the Relation Between Hadoop Clusters and HDFS?

HDFS (Hadoop Distributed File System) is the foundational storage system used by Hadoop clusters. It allows large datasets to be stored across multiple nodes in a distributed manner. HDFS divides data into blocks and distributes them across different nodes, providing redundancy and fault tolerance. The architecture of HDFS ensures data availability and reliability by replicating each block across multiple nodes, typically three, which allows the system to continue functioning even if some nodes fail.

Hadoop clusters leverage HDFS to efficiently store and manage massive amounts of unstructured data. The integration of HDFS with the computational power of Hadoop enables the parallel processing of large datasets. This combination of storage and processing capabilities is what makes Hadoop clusters highly effective for big data analytics, as it ensures data is always close to the computation, minimizing data transfer times and improving overall performance.

What Is the Impact of Cluster Size in Hadoop?

Cluster size in Hadoop refers to the number of nodes within a cluster along with the configuration of each node, including CPU cores, memory, and storage capacity. This size directly impacts the cluster’s ability to process and store data. 

Proper sizing is crucial for optimal performance; too few nodes or insufficient resources can lead to bottlenecks, while too many can increase costs unnecessarily. Determining the appropriate cluster size involves understanding the specific data processing needs and workload characteristics. Factors to consider include data volume, complexity of the tasks, expected response time, and future growth. 

Pros and Cons of Hadoop Clusters 

A Hadoop cluster offers the following advantages:

  • Processing speed: Breaks down large computational tasks into smaller, manageable chunks that can be processed in parallel across multiple nodes. Data is stored on the same nodes where computation occurs, minimizing data movement. This distributed computing model reduces processing time compared to traditional single-node processing systems.  
  • Scalability: Can expand or contract according to the data processing demands. This flexibility is achieved through the addition or removal of nodes without disrupting cluster operations. As data volumes grow, new nodes can be integrated into the cluster, enhancing its capacity to store and process larger datasets.
  • High availability: Provides data replication and automatic failover mechanisms. Data stored in HDFS is replicated across multiple nodes, ensuring that in the event of a node failure, there is no loss of information, and data processing can continue uninterrupted. Hadoop’s ecosystem includes components like Zookeeper for managing cluster configuration information and YARN (Yet Another Resource Negotiator) for resource management, which together enable automatic failover. If a master node fails, these systems ensure a new leader is elected promptly from among the available nodes. 

Hadoop clusters can also introduce some challenges:

  • Resource management: Involves efficiently allocating system resources like CPU, memory, and storage across all nodes to ensure optimal performance. It can be complex to balance the demands of concurrent jobs and maximizing resource utilization without causing contention or bottlenecks. 
  • File size limits: HDFS is designed to handle large files, typically setting block sizes to 64 MB or more, which optimizes the system for streaming large datasets. However, this configuration poses challenges when dealing with a multitude of small files. Each file in HDFS occupies at least one block, and given that each block’s metadata is stored in memory by the NameNode, an excessive number of small files can quickly overwhelm the NameNode’s memory capacity.  
  • Performance limits: Hadoop clusters face performance issues when dealing with complex data processing workloads:
    • The architecture relies on moving large volumes of data across the network for processing, which can lead to significant latency as the cluster size increases.
    • Hadoop relies on disk-based storage, where read/write operations can become bottlenecks in data-intensive applications. 
    • The MapReduce programming model is not optimized for all types of computations. Tasks that require iterative processing or real-time analytics often face inefficiencies due to the overhead of job setup and teardown between each iteration. 
    • The need to write intermediary results to disk between map and reduce phases introduces delays that impact performance, making certain analytical queries and data transformations less responsive than needed for time-sensitive applications.

Tutorial: Setting Up a Single Node Cluster in Hadoop

Prerequisites

Before setting up a Hadoop single-node cluster, ensure your system meets the necessary prerequisites:

  1. Java must be installed, as Hadoop is a Java-based framework.  Verify Java installation by executing java -version
  2. If not present, install it using your package manager. For Ubuntu Linux, use the following commands:
$ sudo apt-get update<br>$ sudo apt-get install default-jdk
  1. Next, both SSH and PDSH are required for managing remote Hadoop daemons and efficient resource management. Check SSH availability with ssh -V
  2. If absent, install SSH and PDSH with $ sudo apt-get install ssh pdsh.

Download Hadoop

To obtain a Hadoop distribution, it’s essential to download a stable release from an Apache Download Mirror. This ensures you’re working with a version that’s been widely tested and is considered reliable for production or development purposes:

  1. Select a suitable version (in this tutorial we will use Hadoop 3.3.6), and proceed with the download process by visiting the official Apache Hadoop website or directly accessing one of the mirror sites.
  2. Once the download is complete, unpack the Hadoop distribution archive on your system. This can typically be done using file extraction tools available on your operating system or through command-line utilities like tar for Linux-based systems.  

Prepare the Hadoop Cluster for Launch

Before initiating the Hadoop cluster, it’s crucial to configure the environment properly: 

  1. Begin by editing the etc/hadoop/hadoop-env.sh file within your Hadoop distribution. This involves setting the Java environment variable to ensure Hadoop uses the correct Java version. 
  2. Add or modify the line specifying JAVA_HOME as follows:
export JAVA_HOME=/usr/java/latest

This line sets JAVA_HOME to the path where Java is installed on your system, which is necessary for Hadoop’s operation.

  1. Next, verify the setup by executing the Hadoop script with no arguments:
$ bin/hadoop

This command should display usage documentation for the hadoop script, confirming that Hadoop can be executed successfully on your system.  

Run Hadoop as a Standalone Operation 

In standalone mode, Hadoop runs as a single Java process without utilizing HDFS, making it suitable for debugging. To demonstrate, consider the example of running a MapReduce job locally:

  1. First, prepare your input by copying configuration files into an input directory:
$ mkdir input<br>$ cp etc/hadoop/*.xml input

This creates an input directory and copies XML configuration files into it. 

  1. Next, execute a MapReduce job using the grep example provided with Hadoop to search through the copied XML files:
$ bin/hadoop jar 
share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar 
grep input output 'dfs[a-z.]+'

This command runs the grep example from the MapReduce examples jar, searching for occurrences of the regular expression ‘dfs[a-z.]+‘ within the files located in input, writing the results to an output directory. 

  1. After completion, view the results with:
$ cat output/*&nbsp;

Run Hadoop as a Pseudo-Distributed Operation 

In pseudo-distributed mode, each Hadoop daemon operates in a separate Java process, simulating a distributed environment on a single node. For this mode:

  1. Configure etc/hadoop/core-site.xml and etc/hadoop/hdfs-site.xml to specify the filesystem’s URI and set the replication factor to 1, as shown below:
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

These configurations define the default filesystem URI and reduce the HDFS replication factor to one, suitable for a single-node setup.

  1. To enable seamless operation without requiring password entry for SSH connections to localhost (necessary for starting and stopping Hadoop daemons), generate an SSH key pair and authorize it:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa<br>$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys<br>$ chmod 0600 ~/.ssh/authorized_keys

This setup facilitates passphraseless SSH access to localhost, streamlining daemon management commands.

  1. Next, initialize the HDFS file system by formatting it:
$ bin/hdfs namenode -format
  1. Then start the NameNode and DataNode daemons with:
$ sbin/start-dfs.sh

This command launches HDFS services, preparing the environment for data storage and processing. 

  1. To verify successful startup, access the NameNode web interface at http://localhost:9870/, which provides insights into cluster health and activity.
  2. Prepare for MapReduce execution by creating necessary directories in HDFS and copying input files:
$ bin/hdfs dfs -mkdir -p /user/<username>/input<br>$ bin/hdfs dfs -put etc/hadoop/*.xml /user/<username>/input
  1. Run a MapReduce job with:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep /user/<username>/input /user/<username>/output 'dfs[a-z.]+'

This command searches through XML configuration files in HDFS for matches to the regular expression ‘dfs[a-z.]+‘, outputting results to /user/<username>/output

  1. To review job outputs, fetch them from HDFS or directly view them using:
$ bin/hdfs dfs -cat /user/<username>/output/*
  1. Finally, shut down HDFS daemons with $ sbin/stop-dfs.sh, concluding operations in pseudo-distributed mode. 

Hadoop Optimization with Intel® Tiber™ App-Level Optimization

Optimizing Hadoop workloads with App-Level Optimization can significantly reduce costs through continuous and autonomous performance enhancements. By implementing such optimization solutions, Hadoop processing times are reduced, and pipeline throughput is increased, enabling organizations to meet their Service Level Agreements (SLAs) efficiently. This ensures that data operations within the Hadoop ecosystem are both cost-effective and capable of handling large volumes of data swiftly and reliably.

Optimize application performance.

Save on cloud costs.

Start Now
Back to blog