In the next few posts we will go trough the detailed setup of a Hadoop environment. In order to really understand the setup we will configure the different Cloudera Hadoop services one by one and examine their configuration and options.
Here we will focus on the initial installation and preparation of the Hadoop Nodes. I`m a bit limited running this in my ProxMox laboratory, so we will use some under-spec virtual machines. They will be more than sufficient for the demonstration, but make sure to use properly sized machines and cluster if you are building a Production environment. You can find some good cluster sizing guides here and here.
Here are our future cluster specs:
|Host Name||Role||CPU Count||RAM||Main Disk||Data Disk||IP Address|
|CentOSBLGTemplate||Template only, not used after the install||2||20 GB||50 GB||n/a||192.168.60.60|
|CNN1||Control & Name Node||2||20 GB||50 GB||n/a||192.168.60.61|
|SNN2||Secondary Name Node||2||20 GB||50 GB||n/a||192.168.60.62|
|DCN3||Data & Compute Node||4||16 GB||50 GB||500 GB||192.168.60.63|
|DCN4||Data & Compute Node||4||16 GB||50 GB||500 GB||192.168.60.64|
|DCN5||Data & Compute Node||4||16 GB||50 GB||500 GB||192.168.60.65|
|DCN6||Data & Compute Node||4||16 GB||50 GB||500 GB||192.168.60.66|
It probably looks a bit strange to have less cores on the Name Nodes, but they will be dedicated to that role only and not used for computations. One of the main principles of Hadoop is to pair data with computation power - that is why the Data Nodes have more cores as well as data space..
So I start by creating one VM in ProxMox. You can do this in any virtualization environment or even in a single machine usine VMWare Player, VirtualBox or whatever you fancy. I will use that first VM as a template for all others, to speed up the process. Once the machine is up and running we will do a Minimal Install of CentOS 7, configure all options and then clone it to create the whole cluster.
This is really a default CentOS 7 installation, using this iso from the CentOS mirrors list. I took all the default settings, let the system split the only available disk and made a static IP4 network configuration. Note that I configured the network during the install so I have ssh access to the machine right away.
Once the Template is up and running, we can ssh to it and start doing the basic configuration. I`m using this template to set all the common configuration, so I don`t have to do it separately on each node. When its done we will just clone that off to each node.
List of actions to be done on the template
Define the domain and DNS servers:
#nano /etc/resolv.conf search datasourcing.eu nameserver 18.104.22.168
#hostnamectl set-hostname CentOSBLGTemplate
Disable SE Linux
#nano /etc/selinux/config SELINUX=disabled
#systemctl disable firewalld #systemctl stop firewalld #systemctl status firewalld
Install Perl & SSH clients
#yum –y install perl open ssh-clients
Install and configure JAVA
#cd ~ #wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u51-b16/jdk-8u51-linux-x64.tar.gz #tar -zxvf jdk-8u51-linux-x64.tar.gz #mv jdk1.8.0_51 /usr/ #/usr/sbin/alternatives --install /usr/bin/java java /usr/jdk1.8.0_51/bin/java 2 #/usr/sbin/alternatives --config java #echo "export JAVA_HOME=/usr/jdk1.8.0_51/" >> /etc/profile #echo "export JRE_HOME=/usr/jdk1.8.0_51/jre/" >> /etc/profile #echo "export PATH=$JAVA_HOME/bin:$PATH" >> /etc/profile
This step is optional, in case you plan to use the Cloudera Manager. It is also not sufficient for the manual installation as we will have to add the JAVA_HOME to /etc/default/bigtop-utils, which will be installed later.
List all host names in /etc/hosts
#nano /etc/hosts 192.168.60.60 CentOSBLGTemplate CentOSBLGTemplate.datasourcing.eu 192.168.60.61 CNN1 CNN1.datasourcing.eu 192.168.60.62 SNN2 SNN2.datasourcing.eu 192.168.60.63 DCN3 DCN3.datasourcing.eu 192.168.60.64 DCN4 DCN4.datasourcing.eu 192.168.60.65 DCN5 DCN5.datasourcing.eu 192.168.60.66 DCN6 DCN6.datasourcing.eu
Generate SSH keys for passwordless login beTween the machines. This step is not mandatory but makes your life a bit easier. In the case with the VM template just the commands below are enough to make it work. If you are using real machines, you will have to manually copy the id_rsa.pub file across all nodes.
Disable Host Key check - relatively bad practice, you can just skip it and it will ask you if you trust the host once for each machine when you thy to ssh to it
#nano /etc/ssh/ssh_config strictHostkeychecking No
An important step is to setup time synchronization:
#yum install -y ntp #systemctl start ntpd #systemctl enable ntpd #ntpq -p
Update all packages on the system
#yum –y update
Set kernel parameters to disable swapping:
#echo 'vm.swappiness = 0' >> /etc/sysctl.conf #echo 'net.ipv4.tcp_retries2 = 2' >> /etc/sysctl.conf #echo 'vm.overcommit_memory = 1' >> /etc/sysctl.conf #echo 'net.core.somaxconn = 4096' >> /etc/sysctl.conf #sysctl -p
Last step is to reboot and make sure all settings are set. Check each of the above once the machine comes back online
Individual machine configuration
Once we have verified that all options are correct, we can start cloning the machine to create all the nodes. Please note that they will all now have the same hostname and IP address so we will need to change that. Also I will change the hardware setup of the Data Nodes to give them more CPUs, extra disk space and reduce the RAM.
Update IP address on each machine. nmcli tells us which is the interface in use. Then we update the network script for it according to the hosts table above. We also update the host name as we did before:
#nmcli dev status #nano /etc/sysconfig/network-scripts/ifcfg-eth0 IPADDR="192.168.60.61" #hostnamectl set-hostname CNN1
We need to to this on each machine according to the host table above. In some cases when you clone the machine, the old interface is lost and a new network interface is added that has to be enabled first.
We also need to initialize the new hard drives of the Data Nodes:
#fdisk -l #fdisk /dev/vdb #mkfs.ext4 /dev/vdb1 #mkdir /mnt/dn #chmod -R 777 /mnt/dn #echo '/dev/vdb1 /mnt/dn ext4 defaults 0 0' >> /etc/fstab #reboot
The screenshot above shows the disk space and mount information after the reboot. Note the last line - this is where our Datanode will store HDFS data. CNN1 and SNN2 don`t have this folder and mount.
As an optional step I will configure the Cloudera repo, to be able to yum install Cloudera packages:
#yum install -y wget #cd /etc/yum.repos.d/ #wget http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
Before we wrap it up, we need to make sure that all machines can ping each other by the hostname, that all hardware resources are properly mapped and all packages are updated with their latest versions. Once this is done we can say our Cluster is ready for the next steps.
From here we can continue with the standard installation using the Cloudera Manager or just install different Hadoop Services manually.