Preparations to install Hadoop on CentOS 7

In the next few posts we will go trough the detailed setup of a Hadoop environment. In order to really understand the setup we will configure the different Cloudera Hadoop services one by one and examine their configuration and options.

Here we will focus on the initial installation and preparation of the Hadoop Nodes. I`m a bit limited running this in my ProxMox laboratory, so we will use some under-spec virtual machines. They will be more than sufficient for the demonstration, but make sure to use properly sized machines and cluster if you are building a Production environment. You can find some good cluster sizing guides here and here.

Here are our future cluster specs:

Host Name Role CPU Count RAM Main Disk Data Disk IP Address
CentOSBLGTemplate Template only, not used after the install 2 20 GB 50 GB n/a
CNN1 Control & Name Node 2 20 GB 50 GB n/a
SNN2 Secondary Name Node 2 20 GB 50 GB n/a
DCN3 Data & Compute Node 4 16 GB 50 GB 500 GB
DCN4 Data & Compute Node 4 16 GB 50 GB 500 GB
DCN5 Data & Compute Node 4 16 GB 50 GB 500 GB
DCN6 Data & Compute Node 4 16 GB 50 GB 500 GB

It probably looks a bit strange to have less cores on the Name Nodes, but they will be dedicated to that role only and not used for computations. One of the main principles of Hadoop is to pair data with computation power - that is why the Data Nodes have more cores as well as data space..

So I start by creating one VM in ProxMox. You can do this in any virtualization environment or even in a single machine usine VMWare Player, VirtualBox or whatever you fancy. I will use that first VM as a template for all others, to speed up the process. Once the machine is up and running we will do a Minimal Install of CentOS 7, configure all options and then clone it to create the whole cluster.

This is really a default CentOS 7 installation, using this iso from the CentOS mirrors list. I took all the default settings, let the system split the only available disk and made a static IP4 network configuration. Note that I configured the network during the install so I have ssh access to the machine right away.

Once the Template is up and running, we can ssh to it and start doing the basic configuration. I`m using this template to set all the common configuration, so I don`t have to do it separately on each node. When its done we will just clone that off to each node.

List of actions to be done on the template

Define the domain and DNS servers:

#nano /etc/resolv.conf

search datasourcing.eu

Set hostname

#hostnamectl set-hostname CentOSBLGTemplate

Disable SE Linux

#nano /etc/selinux/config

Disable Firewall

#systemctl disable firewalld
#systemctl stop firewalld
#systemctl status firewalld

Install Perl & SSH clients

#yum –y install perl open ssh-clients

Install and configure JAVA

#cd ~
#wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u51-b16/jdk-8u51-linux-x64.tar.gz

#tar -zxvf jdk-8u51-linux-x64.tar.gz
#mv jdk1.8.0_51 /usr/

#/usr/sbin/alternatives --install /usr/bin/java java /usr/jdk1.8.0_51/bin/java 2
#/usr/sbin/alternatives --config java

#echo "export JAVA_HOME=/usr/jdk1.8.0_51/" >> /etc/profile
#echo "export JRE_HOME=/usr/jdk1.8.0_51/jre/" >> /etc/profile
#echo "export PATH=$JAVA_HOME/bin:$PATH" >> /etc/profile

This step is optional, in case you plan to use the Cloudera Manager. It is also not sufficient for the manual installation as we will have to add the JAVA_HOME to /etc/default/bigtop-utils, which will be installed later.

List all host names in /etc/hosts

#nano /etc/hosts CentOSBLGTemplate CentOSBLGTemplate.datasourcing.eu CNN1 CNN1.datasourcing.eu SNN2 SNN2.datasourcing.eu DCN3 DCN3.datasourcing.eu DCN4 DCN4.datasourcing.eu DCN5 DCN5.datasourcing.eu DCN6 DCN6.datasourcing.eu

Generate SSH keys for passwordless login beTween the machines. This step is not mandatory but makes your life a bit easier. In the case with the VM template just the commands below are enough to make it work. If you are using real machines, you will have to manually copy the id_rsa.pub file across all nodes.

# ssh-keygen

Disable Host Key check - relatively bad practice, you can just skip it and it will ask you if you trust the host once for each machine when you thy to ssh to it

#nano /etc/ssh/ssh_config
strictHostkeychecking No

An important step is to setup time synchronization:

#yum install -y ntp
#systemctl start ntpd
#systemctl enable ntpd
#ntpq -p

Update all packages on the system

#yum –y update

Set kernel parameters to disable swapping:

#echo 'vm.swappiness = 0' >> /etc/sysctl.conf
#echo 'net.ipv4.tcp_retries2 = 2' >> /etc/sysctl.conf
#echo 'vm.overcommit_memory = 1' >> /etc/sysctl.conf
#echo 'net.core.somaxconn = 4096' >> /etc/sysctl.conf
#sysctl -p

Last step is to reboot and make sure all settings are set. Check each of the above once the machine comes back online


Individual machine configuration

Once we have verified that all options are correct, we can start cloning the machine to create all the nodes. Please note that they will all now have the same hostname and IP address so we will need to change that. Also I will change the hardware setup of the Data Nodes to give them more CPUs, extra disk space and reduce the RAM.

Update IP address on each machine. nmcli tells us which is the interface in use. Then we update the network script for it according to the hosts table above. We also update the host name as we did before:

#nmcli dev status
#nano /etc/sysconfig/network-scripts/ifcfg-eth0

#hostnamectl set-hostname CNN1

We need to to this on each machine according to the host table above. In some cases when you clone the machine, the old interface is lost and a new network interface is added that has to be enabled first.

We also need to initialize the new hard drives of the Data Nodes:

#fdisk -l
#fdisk /dev/vdb
#mkfs.ext4 /dev/vdb1
#mkdir /mnt/dn
#chmod -R 777 /mnt/dn
#echo '/dev/vdb1 /mnt/dn ext4 defaults 0 0' >> /etc/fstab

The screenshot above shows the disk space and mount information after the reboot. Note the last line - this is where our Datanode will store HDFS data. CNN1 and SNN2 don`t have this folder and mount.

As an optional step I will configure the Cloudera repo, to be able to yum install Cloudera packages:

#yum install -y wget
#cd /etc/yum.repos.d/
#wget http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo


Before we wrap it up, we need to make sure that all machines can ping each other by the hostname, that all hardware resources are properly mapped and all packages are updated with their latest versions. Once this is done we can say our Cluster is ready for the next steps.

From here we can continue with the standard installation using the Cloudera Manager or just install different Hadoop Services manually.

Stay put for our next posts to learn more!