Multi-node Cluster of Hadoop 2 with Zookeeper and NFS mount-point

This Part of Blog we will learn about fully distributed mode cluster setup of automatic failover HA in cloud (AWS). We will install  hadoop 2. 6. 5 in all nodes that are running on CentOS 6. 5. the setup demonstrates to set up a fully distributed hadoop cluster with automatic failover using zookeeper

Note: if you still have not installed zookeeper then follow this link.

we are going to setup 3  Namenode  and 2  Datanodes  cluster. Bellow is the list of my machines in the cluster and obviously you may have different IPs and hostname.  For more details on  Namenodes  and  Datanodes  check out Hadoop/HDFS documentation.

Note:  Before moving directly towards the installation and configuration part, read the pre-checks listed below.

1. Disable firewall on all nodes.
2. Disable selinux on all nodes.
3. Update the hostname and their respective ip-addresses and hostname of all nodes in /etc/hosts file on all nodes.
namenode1 ha-nn01 192. 168. 10. 101
namenode2 ha-nn02 192. 168. 10. 102
namenode3 ha-nn03 192. 168. 10. 103
datanode1 ha-dn01 192. 168. 10. 104
datanode2 ha-dn02 192. 168. 10. 105
namenode1   ha – nn01     192. 168. 10. 101
namenode2   ha – nn02     192. 168. 10. 102
namenode3   ha – nn03     192. 168. 10. 103
datanode1   ha – dn01     192. 168. 10. 104
datanode2   ha – dn02     192. 168. 10. 105

➤  Step 1 is to have Java 7 or Java 8 installed in all nodes.
Let’s install Java 8 by adding custom repository from here.

➤  Step 2 is to add system group and user for hadoop in all nodes.
sudo addgroup hadoop
sudo adduser –ingroup hadoop hduser
sudo adduser hduser sudo
sudo addgroup hadoop
sudo adduser — ingroup hadoop hduser
sudo adduser hduser sudo

➤  Step 3 is to make sure hduser can ssh to its own account without password in all nodes.
Hadoop requires SSH access to manage its nodes (including local node).
[[email protected]]$ sudo su – hduser
[[email protected]]$ ssh-keygen -t rsa -P””-f /home/hduser/. ssh/id_rsa
[[email protected]]$ cat /home/hduser. ssh/id_rsa. pub >> /home/hduser/. ssh/authorized_keys
[[email protected]]$ ssh localhost
[ hduser @ ha – nn01 ] $ sudo su – hduser
[ hduser @ ha – nn01 ] $ ssh – keygen – t rsa – P “” – f / home / hduser /. ssh / id _ rsa
[ hduser @ ha – nn01 ] $ cat / home / hduser. ssh / id_rsa. pub >> / home / hduser /. ssh / authorized _ keys
[ hduser @ ha – nn01 ] $ ssh localhost

➤  Step 4 is to copy private key of  hduser on all node so can ssh to nodes without password.
[[email protected]]$ ssh-copy-id -i /home/hduser/. ssh/id_rsa. pub [email protected] 168. 10. 102
[[email protected]]$ ssh-copy-id -i /home/hduser/. ssh/id_rsa. pub [email protected] 168. 10. 103
[[email protected]]$ ssh-copy-id -i /home/hduser/. ssh/id_rsa. pub [email protected] 168. 10. 104
[[email protected]]$ ssh-copy-id -i /home/hduser/. ssh/id_rsa. pub [email protected] 168. 10. 105
[ hduser @ ha – nn01 ] $ ssh – copy – id – i / home / hduser /. ssh / id_rsa. pub hduser @ 192. 168. 10. 102
[ hduser @ ha – nn01 ] $ ssh – copy – id – i / home / hduser /. ssh / id_rsa. pub hduser @ 192. 168. 10. 103
[ hduser @ ha – nn01 ] $ ssh – copy – id – i / home / hduser /. ssh / id_rsa. pub hduser @ 192. 168. 10. 104
[ hduser @ ha – nn01 ] $ ssh – copy – id – i / home / hduser /. ssh / id_rsa. pub hduser @ 192. 168. 10. 105

➤  Step 5 Download Apache Hadoop 2. 6. 5 from here
Download the below packages and place them in all nodes.

➤  Step 6 is to copy Hadoop distribution to other nodes.
As the compilation is done let’s copy the distribution to other nodes.
[[email protected]]$ cd /apps
[[email protected]]$ wget
[[email protected]]$ tar -xzvf hadoop-2. 6. 5. tar. gz
[[email protected]]$ mv hadoop-2. 6. 5 hadoop
[[email protected]]$ sudo chown -R huser: hadoop hadoop/
[ hduser @ ha – nn01 ] $ cd / apps
[ hduser @ ha – nn01 ] $ wget
[ hduser @ ha – nn01 ] $ tar – xzvf hadoop – 2. 6. 5. tar. gz
[ hduser @ ha – nn01 ] $ mv hadoop – 2. 6. 5 hadoop
[ hduser @ ha – nn01 ] $ sudo chown – R huser: hadoop hadoop /
scp /apps/hadoop. tar. gz [email protected] 168. 10. 102: /home/hduser/
scp /apps/hadoop. tar. gz [email protected] 168. 10. 103: /home/hduser/
scp /apps/hadoop. tar. gz [email protected] 168. 10. 104: /home/hduser/
scp /apps/hadoop. tar. gz [email protected] 168. 10. 105: /home/hduser/
scp / apps / hadoop. tar. gz hduser @ 192. 168. 10. 102: / home / hduser /
scp / apps / hadoop. tar. gz hduser @ 191. 168. 10. 103: / home / hduser /
scp / apps / hadoop. tar. gz hduser @ 192. 168. 10. 104: / home / hduser /
scp / apps / hadoop. tar. gz hduser @ 192. 168. 10. 105: / home / hduser /

➤  Step 8 is to configure environment variables in all nodes.

➤  Step 7 is to unpack Hadoop distribution in all nodes.
Just add the file vim /etc/profile. d/hadoop. sh and paste below given lines…
#JAVA Path Setting
export JAVA_HOME=/apps/java8
export JRE_HOME=/apps/java8/jre
export PATH=$PATH: $JAVA_HOME/bin/java8: $JRE_HOME/bin
#Zookeeper Path Setting
export ZOOKEEPER_HOME=/apps/zookeeper
export PATH=$PATH: /apps/zookeeper/bin
#Hadoop Path Setting
export HADOOP_PREFIX=/apps/hadoop
export PATH=$PATH: $HADOOP_PREFIX/bin
export PATH=$PATH: $HADOOP_PREFIX/sbin
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#JAVA Path Setting
export JAVA_HOME = / apps / java8
export JRE_HOME = / apps / java8 / jre
export PATH = $PATH: $JAVA_HOME / bin / java8: $JRE_HOME / bin
#Zookeeper Path Setting
export ZOOKEEPER_HOME = / apps / zookeeper
export PATH = $PATH: / apps / zookeeper / bin
#Hadoop Path Setting
export HADOOP_PREFIX = / apps / hadoop
export PATH = $PATH: $HADOOP_PREFIX / bin
export PATH = $PATH: $HADOOP_PREFIX / sbin
export HADOOP_HOME = $HADOOP_PREFIX
export HADOOP_CONF_DIR = $HADOOP_HOME / etc / hadoop
chmod 755 /etc/profile. d/hadoop. sh
source /etc/profile. d/hadoop. sh
chmod 755   / etc / profile. d / hadoop. sh
source / etc / profile. d / hadoop. sh

➤  Step 9 is to verify hadoop setup in all nodes.
Just type  hadoop version  in your console, you should get the following output:
hduser: ~$ hadoop version
Hadoop 2. 6. 5
Subversion
Compiled by jenkins on 2014-11-13T21: 10Z
Compiled with protoc 2. 5. 0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
hduser: ~ $ hadoop version
Hadoop 2. 6. 5
Subversion
Compiled by jenkins on 2014 – 11 – 13T21: 10Z
Compiled with protoc 2. 5. 0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /apps/hadoop/share/hadoop/common/hadoop-common-2. 6. 5. jar
Note: Envirnoment of  hadoop has to be done on all nodes. Which means you have successfully installed hadoop instances.

➤  Step 10 is to configure Namenode.
This step includes the part of the configuration which is only specific to Namenode.
mkdir -pv $HADOOP_PREFIX/namenode/data
mkdir -pv $HADOOP_PREFIX/logs
chown -R hduser: hadoop /apps/hadoop
mkdir – pv $HADOOP_PREFIX / namenode / data
mkdir – pv $HADOOP_PREFIX / logs
chown – R hduser: hadoop / apps / hadoop

➤ Hadoop Configuration for Automatic Failover
There are a couple of files that need to be configured to make hadoop with automatic failover cluster up and running. All our configuration files reside in /apps/hadoop/etc/hadoop/ directory.
Additionally change JAVA_HOME variable value in $HADOOP_CONF_DIR/hadoop-env. sh Afterwards source the /etc/profile. d/hadoop. sh file
export JAVA_HOME=/apps/java8
export HADOOP_LOG_DIR=/apps/hadoop/logs
export JAVA_HOME = / apps / java8
export HADOOP_LOG_DIR = / apps / hadoop / logs
We need to edit  hdfs-site. xml  file  vim $HADOOP_CONF_DIR/hdfs-site. xml  and replace configuration content with the following: (make sure to use your IPs instead of one below)
And finally we need to tell Namenode what are the datanodes for the cluster. For this open the  $HADOOP_CONF_DIR/slaves  file and replace the content with the following: (make sure to use your hostname instead of one below)
ha-dn01
ha-dn02
This step includes the part of the configuration which is only specific to Datanodes. ➤  Step 11 is to configure Datanodes.
Let’s create datanode and log directories.
mkdir -pv $HADOOP_PREFIX/datanode/data
mkdir -pv $HADOOP_PREFIX/logs
mkdir – pv $HADOOP_PREFIX / datanode / data
mkdir – pv $HADOOP_PREFIX / logs
This step includes configurations which should be applied to every node. ➤  Step 12 is to apply remaining configuration for Namenode and Datanodes.
We also need to edit  hdfs-site. xml  file  vim $HADOOP_CONF_DIR/hdfs-site. xml  and replace configuration content with the following: (make sure to use your hostname instead of one below)
dfs. replication
2
dfs. name. dir
file: ///apps/hadoop/namenode/data
dfs. data. dir
file: ///apps/hadoop/datanode/data
dfs. permissions
false
dfs. nameservices
auto-ha
dfs. ha. namenodes. auto-ha
nn01, nn02
dfs. namenode. rpc-address. auto-ha. nn01
ha-nn01: 8020
dfs. namenode.
ha-nn01: 50070
dfs. namenode. rpc-address. auto-ha. nn02
ha-nn02: 8020
dfs. namenode.
ha-nn02: 50070
dfs. namenode. shared. edits. dir
file: ///mnt/hadoop_share/
dfs. ha. fencing. methods
sshfence
dfs. ha. fencing. ssh. private-key-files
/home/hduser/. ssh/id_rsa
dfs. ha. automatic-failover. enabled. auto-ha
true
ha. zookeeper. quorum
ha-nn01: 2181, ha-nn02: 2181, ha-nn03: 2181
dfs. replication
2
dfs. name. dir
file: / / / apps / hadoop / namenode / data
dfs. data. dir
file: / / / apps / hadoop / datanode / data
dfs. permissions
false
dfs. nameservices
auto – ha
dfs. ha. namenodes. auto – ha
nn01, nn02
dfs. namenode. rpc – address. auto – ha. nn01
ha – nn01: 8020
dfs. namenode.
ha – nn01: 50070
dfs. namenode. rpc – address. auto – ha. nn02
ha – nn02: 8020
dfs. namenode.
ha – nn02: 50070
dfs. namenode. shared. edits. dir
file: / / / mnt / hadoop_share /
dfs. ha. fencing. methods
sshfence
dfs. ha. fencing. ssh. private – key – files
/ home / hduser /. ssh / id_rsa
dfs. ha. automatic – failover. enabled. auto – ha
true
ha. zookeeper. quorum
ha – nn01: 2181, ha – nn02: 2181, ha – nn03: 2181
Open the  $HADOOP_CONF_DIR/core-site. xml  file and and replace configuration content with the following: (make sure to use your hostname instead of one below)
fs. default. name
hdfs: //auto-ha
fs. default. name
hdfs: / / auto – ha
Open the  $HADOOP_CONF_DIR/mapred-site. xml  file and and replace configuration content with the following:
mapreduce. framework. name
yarn
mapreduce. framework. name
yarn
Open the  $HADOOP_CONF_DIR/yarn-site. xml  file and and replace configuration content with the following:
yarn. nodemanager. aux-services
mapreduce_shuffle
yarn. nodemanager. aux-services. mapreduce_shuffle. class
org. apache. hadoop. mapred. ShuffleHandler
yarn. resourcemanager. resource-tracker. address
auto-ha: 8025
yarn. resourcemanager. scheduler. address
auto-ha: 8030
yarn. resourcemanager. address
auto-ha: 8050
yarn. nodemanager. aux – services
mapreduce_shuffle
yarn. nodemanager. aux – services. mapreduce_shuffle. class
org. apache. hadoop. mapred. ShuffleHandler
yarn. resourcemanager. resource – tracker. address
auto – ha: 8025
yarn. resourcemanager. scheduler. address
auto – ha: 8030
yarn. resourcemanager. address
auto – ha: 8050
That’s it! Your Hadoop/YARN Cluster is running now!
NFS Shared point: edits directory is a permanently mounted nfs share on all namenodes only.
We can explicitly enable automatic-failover for the nameservice-id ‘auto-ha’ by setting the property ‘dfs. ha. automatic-failover. enabled. auto-ha’ to ‘true’.
Finally after completing the configuration part, we will be initializing and starting our automatic-failover hadoop cluster.

➤  Step 13  Initializing HA state in ZooKeeper
Zookeeper needs to initialize the required state by running the below command from any one of the namenodes.
[email protected]: ~$ hdfs zkfc -formatZK
hduser @ ha – nn01: ~ $ hdfs zkfc – formatZK
Formatting & Starting Namenodes
Both the namenodes need to be formatted to start HDFS filesystem.
Server: ha-nn01
[email protected]: ~$ hadoop namenode -format
[email protected]: ~$ hadoop-daemon. sh start namenode
hduser @ ha – nn01: ~ $ hadoop namenode – format
hduser @ ha – nn01: ~ $ hadoop – daemon. sh start namenode
Server: ha-nn02
[email protected]: ~$ hadoop namenode -bootstrapStandby
[email protected]: ~$ hadoop-daemon. sh start namenode
hduser @ ha – nn02: ~ $ hadoop namenode – bootstrapStandby
hduser @ ha – nn02: ~ $ hadoop – daemon. sh start namenode
Note: By default both the namenodes will be in ‘standby’ state.
Starting ZKFC Services
Zookeeper Failover Controller service needs to be started in order to make any one namenode as ‘active’. Run the below command on both namenodes.
[email protected]: ~$ hadoop-daemon. sh start zkfc
[email protected]: ~$ hadoop-daemon. sh start zkfc
hduser @ ha – nn01: ~ $ hadoop – daemon. sh start zkfc
hduser @ ha – nn02: ~ $   hadoop – daemon. sh start zkfc
Note: As soon as the zkfc service is started you can see that one of the namenode is in active state using below command from any one of the namenodes.
[email protected]: ~$ hdfs haadmin -getServiceState nn01
[email protected]: ~$ hdfs haadmin -getServiceState nn02
hduser @ ha – nn01: ~ $ hdfs haadmin – getServiceState nn01
hduser @ ha – nn01: ~ $ hdfs haadmin – getServiceState nn02
Starting Datanodes
To start the datanodes run the below mentioned command from any one of the namenodes.
[email protected]: ~$ hadoop-daemons. sh start datanode
hduser @ ha – nn01: ~ $ hadoop – daemons. sh start datanode
Verifying Automatic Failover
To verify the automatic failover, we need to locate the active namenode using command line or by visiting the namenode web interfaces.
Using command line
[email protected]: ~$ hdfs haadmin -getServiceState nn01
[email protected]: ~$ hdfs haadmin -getServiceState nn02
hduser @ ha – nn01: ~ $ hdfs haadmin – getServiceState nn01
hduser @ ha – nn01: ~ $ hdfs haadmin – getServiceState nn02
After locating the active namenode, we can cause a failure on that node to initiate a failover automatically. One can fail the active namenode by running ‘jps’ command and kill the namenode daemon by determining it’s pid. Within a few seconds the other namenode will automatically become active.
Wait! what about running some map reduce jobs ?

➤  And Finally Step 14 is to run map reduce examples.
The default hadoop installation comes with sample jobs that we can run. Let’s get the text file of  The Adventures of Sherlock Holmes, by Arthur Conan Doyle  and run the  word count MapReduce job.
#### Word Count Example
wget
hdfs dfs -mkdir /samples/input
hdfs dfs -put pg1661. txt /samples/input
hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2. 6. 5. jar wordcount /samples/input /samples/output
hdfs dfs -cat /samples/output/part* | less
#### PI calculation example with 16 maps and 100000 samples, run the following command:
hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2. 6. 5. jar pi 16 100000
#### To see all examples run the following command:
hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2. 6. 5. jar
#### Word Count Example
wget
hdfs dfs -mkdir / samples / input
hdfs dfs -put pg1661. txt / samples / input
hadoop jar $HADOOP_PREFIX / share / hadoop / mapreduce / hadoop -mapreduce -examples -2. 6. 5. jar wordcount / samples / input / samples / output
hdfs dfs -cat / samples / output / part * | less
#### PI calculation example with 16 maps and 100000 samples, run the following command:
hadoop jar $HADOOP_PREFIX / share / hadoop / mapreduce / hadoop -mapreduce -examples -2. 6. 5. jar pi 16 100000
#### To see all examples run the following command:
hadoop jar $HADOOP_PREFIX / share / hadoop / mapreduce / hadoop -mapreduce -examples -2. 6. 5. jar
Almost forgot to mention, you can use the following web pages to view the state of your cluster in browser
Using web interface
public-ip-ha-nn01: 50070
public-ip-ha-nn02: 50070
Hope you all like this blog, will soon share some more…

Leave a Reply

Your email address will not be published. Required fields are marked *