Hadoop

Manual installation of Hadoop HDFS on CentOS 7 Cluster

Following the previous post where we prepared six virtual machines for a Hadoop installation, we will now install HDFS manually, using the latest Cloudera Repository and yum install.

The first step is to install the HDFS NameNode and DataNode services and to create the required folders and permissions. The NameNode folders are for storing the Checkpoints and the DataNode folders are for storing the user data.

On CNN1 we will install the primary NameNode service:

#yum install hadoop-hdfs-namenode -y
#mkdir /mnt/nn
#mkdir /mnt/nn/hadoop-hdfs
#chown -R hdfs:hadoop /mnt/nn/hadoop-hdfs

The installation of the hadoop-hdfs-namenode service creates the yarn, hdfs and mapred users belonging to the Hadoop group. Same will users will be automatically created on the other nodes as well.

On SNN2 we will install the SecondaryNameNode service. Since this is the standard setup, the SecondaryNameNode is used only to maintain the fsimage file and to make checkpoints. It will not work as a fail-over NameNode.

#yum install hadoop-hdfs-secondarynamenode -y
#mkdir /mnt/nn
#mkdir /mnt/nn/hadoop-hdfs
#chown -R hdfs:hadoop /mnt/nn/hadoop-hdfs

On DCN3-6 we will install the DataNode services. Note that the /mnt/dn/ is where the extra disk is mounted, which will provide the required HDFS data space:

#yum install hadoop-hdfs-datanode -y
#mkdir /mnt/dn/data
#chown -R hdfs:hadoop /mnt/dn/data

The second step is to create and distribute the minimum HDFS configuration.

core-site.xml is used by all nodes to discover and contact the primary NameNode:

<configuration>
  <property>
   <name>fs.default.name</name>
   <value>hdfs://cnn1:8020</value>
  </property>
</configuration>

hdfs-site.xml is to define the folder locations for the NameNode and DataNodes as we configured them above:

<configuration>
  <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:///mnt/nn/hadoop-hdfs/cache/hdfs/dfs/name</value>
  </property>
  <property>
   <name>dfs.datanode.data.dir </name>
   <value>file:///mnt/dn/data</value>
  </property>
</configuration>

Both files are exactly the same everywhere so we need to execute that on all nodes. I`m using my hosting to share the files with the servers - you can use the same files if needed.

#cd /etc/hadoop/conf
#wget -N http://datasourcing.eu/files/core-site.xml
#wget -N http://datasourcing.eu/files/hdfs-site.xml

The third step is to export JAVA_HOME to /etc/default/bigtop-utils as it is used by the Service scripts. You may skip this step if your bigtop-utils have already detected the JDK location. Should be done on all nodes

#echo "export JAVA_HOME=/usr/jdk1.8.0_51/" >> /etc/default/bigtop-utils

The forth step is to format HDFS. We do that from the hdfs user, as it is a super user for HDFS. Should be done on CNN1

#sudo -u hdfs hdfs namenode -format

Last step is to start all services using the /etc/init.d/ scripts and check the logs:

On CNN1

#service hadoop-hdfs-namenode start
#less /var/log/hadoop-hdfs/hadoop-hdfs-namenode-cnn1.log

On SNN2

#service hadoop-hdfs-secondarynamenode start
#less /var/log/hadoop-hdfs/hadoop-hdfs-secondarynamenode-snn2.log

On DCN3-6

#service hadoop-hdfs-datanode start
#less /var/log/hadoop-hdfs/hadoop-hdfs-datanode-dcn3.log

Once the Primary NameNode is up and running, it exposes a web interface on port 50700. It has general information about your HDFS cluster, available space, DataNodes and even a basic HDFS browser. I`m accessing it from my local browser and since my PC does not know the cluster host names I`m using the IP address. Make sure the port server and port are accessible, otherwise try with a browser on the local machine.

I could see my DataNodes starting up and becoming available for the namenode via this Web Interface.

HDFS is also available via the local hadoop client. On any of the machines - I can create a new folder, upload a file, check its size and see the general space usage:

#sudo -u hdfs hadoop fs -mkdir /test
#sudo -u hdfs hadoop fs -put testfile.txt /test/
#sudo -u hdfs hadoop fs -du -h /test/
#sudo -u hdfs hadoop fs -df -h /

It works!

HDFS is up and running and no extra Hadoop services are required - it`s just the base storage - no Map/Reduce or YARN running. Nothing really fancy so far, but in the next step we will be experimenting with NameNode High Availability and will be addressing the only weak spot of the otherwise amazingly robust system!

Stay put for our next posts to learn more!