This project demonstrates the implementation of a Hadoop cluster with High Availability (HA). Hadoop is a powerful framework for storing and processing large datasets, but its traditional architecture poses a single point of failure with the NameNode. High Availability ensures continuous operation by enabling automatic failover between Active and Standby NameNodes, providing fault tolerance and improving reliability. This README outlines the steps taken to set up the HA cluster, its architecture, and the configuration details for both ZooKeeper and Hadoop.
Our Hadoop cluster is designed with HA in mind and consists of the following components:
- ZooKeeper Cluster (3 Nodes): A distributed coordination service managing leader election and failover for the NameNode.
- Active NameNode: The primary NameNode, responsible for managing the HDFS namespace and client interactions.
- Standby NameNode: A secondary NameNode that takes over if the Active NameNode fails.
- DataNodes (2 Nodes): Nodes that store data blocks. Replication ensures data availability in case of node failure.
The cluster is distributed across two physical machines to improve redundancy and fault tolerance:
- Machine 1: Hosts ZooKeeper and the Standby NameNode.
- Machine 2: Hosts the Active NameNode and the two DataNodes.
- Minimized Downtime: HA ensures the system remains operational during failures by automatically switching to the Standby NameNode.
- Increased Fault Tolerance: The cluster can withstand hardware/software failures without data loss.
- Improved Scalability: The architecture allows for easy addition of more DataNodes without affecting availability.
ZooKeeper plays a central role in Hadoop's HA architecture. Below is a summary of the ZooKeeper setup:
-
Installation:
- Install ZooKeeper on each of the three machines.
- Create a dedicated directory for each instance (e.g.,
Zookeeper_node1
).
-
Configuration (
zoo.cfg
):- Set parameters like
tickTime
,dataDir
,clientPort
,maxClientCnxns
, and define server details for each node.
- Set parameters like
-
Unique Server ID:
- Each machine has a unique ID stored in
/tmp/zookeeper/myid
.
- Each machine has a unique ID stored in
-
Starting ZooKeeper:
- Use
./zkServer.sh start
to launch ZooKeeper on each machine.
- Use
-
Verification:
- Use
echo stat | nc zk_ip 2181
to check ZooKeeper status. - Connect to the cluster using
./zkCli.sh
.
- Use
To enable HA in Hadoop, several configuration files were updated:
- Replication Factor: Set to 2 for block replication.
- NameNode Directories: Define paths for metadata and edit logs.
- Nameservices: Logical name for the HA setup.
- Failover Configuration: Automatic failover between Active and Standby NameNodes.
- Fencing Methods: Prevent split-brain scenarios by ensuring only one NameNode is active.
- Default FileSystem (fs.defaultFS): Points to the logical name of the HA HDFS cluster.
- JournalNode Directories: Directory paths for storing JournalNode logs.
- ZooKeeper Quorum: IP addresses and ports of ZooKeeper servers for managing failover.
- ResourceManager HA: Enables HA for YARN's ResourceManager.
- Cluster ID: Defines active and standby ResourceManagers.
- ZooKeeper Quorum: Manages failover and state recovery for YARN using ZooKeeper.
This GitHub repository contains 7 branches, each representing different stages and features of the Hadoop HA setup. You can explore these branches to understand different configurations and aspects of the cluster.
To see how the failover process works when the Active NameNode goes down and the Standby NameNode takes over, check out the following video:
-
Start ZooKeeper Cluster:
On each ZooKeeper node, navigate to thebin/
directory and run:./zkServer.sh start
-
Start Hadoop Cluster:
Once ZooKeeper is running, start the Active and Standby NameNodes, JournalNodes, and DataNodes by running the Hadoop start script:start-dfs.sh start-yarn.sh
-
Monitor Cluster:
Use the Hadoop web UI or commands likehdfs dfsadmin -report
to monitor the health and status of the cluster.
By implementing High Availability in our Hadoop cluster, we significantly enhanced the system's fault tolerance, reliability, and scalability. The ZooKeeper-managed failover mechanism ensures seamless transitions between Active and Standby NameNodes, minimizing downtime during failures.
For detailed configurations and further steps, refer to the configuration files located in this repository.
For any questions or feedback, feel free to reach out to us:
- Ibrahim Ghali: [[email protected]]
- Hamza Zarai: [[email protected]]