CONFIGURING HADOOP CLUSTER USING ANSIBLE

Sarvjeet Jain

7 min readMar 19, 2021

WELCOME

Welcome you all, in this blog I will cover how we can“ CONFIGURE HADOOP CLUSTER USING ANSIBLE”

Before starting the practical part, let’s first discuss about what is Ansible and Hadoop.

ANSIBLE:-

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code.[2] It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows. It includes its own declarative language to describe system configuration. Ansible was written by Michael DeHaan and acquired by Red Hat in 2015. Ansible is agentless, temporarily connecting remotely via SSH or Windows Remote Management (allowing remote PowerShell execution) to do its tasks.

In this practical we gonna write an Ansible-playbook which is in YAML format for Hadoop configuration.

HADOOP:-

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use.It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

It works on Master-Slave Architecture. It includes one NAME NODE(MASTER NODE OF THE CLUSTER) and several DATA NODE(SLAVE NODES OF THE CLUSTER). In this practical we gonna configure One Name Node and one Data Node.

So, let’s start our practical part:-

NOTE:- I’M USING RED HAT 8 OS AS MY CONTROLLER NODE OF ANSIBLE AND ALSO FOR TARGET NODES(NAME NODE AND DATA NODE)

STEP-1:- First we have to install the ansible tool in our Controller Node. We know that ansible works on top of the Python Language, that’s why for installation of ansible we have to use “pip” software. Command for installation:-

pip3 install ansible

We also have to install the “sshpass” software. sshpass is needed so that Ansible can respond to the password prompt for ssh connections. Commands:-

dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

First install EPEL-RELEASE if you are working in same OS. Then run:-

yum install sshpass

Ansible is successfully installed, How we can check?. Run this command to for confirmation:-

ansible --version

STEP-2:- Now we are going to configure the ansible, first we have to create the ansible configuration directory which is “/etc/ansible” and inside this directory we have to create one conf. file of ansible which is “ansible.cfg”. Inside the conf. file we have to tell the ansible that where the Inventory file of ansible we created inside our workstation. Commands:-

mkdir /etc/ansible
vim /etc/ansible/ansible.cfg

STEP-3:- Now we have to create the inventory file of the ansible in the same directory which we had mentioned in conf. file. Commands:-

vim ip.txt

Inside this file, we have to put the IP of the Target Node of ansible and the user credentials which ansible will use to enter inside the target node. Here first I’m giving the Name Node IP of my Hadoop.

Finally, ansible is configured. We can check the connectivity between the both Target and Controller Node that they are connected or not by using this command:-

ansible all -m ping

Here you can see that my both nodes has connectivity.

STEP-3:- Now we have to create an playbook of ansible for configuring the Name Node of my Hadoop cluster. Command :-

vim namenode.yml

Let’s discuss all the modules that I have used in my playbook:-

1- vars_prompt:- If you want your playbook to prompt the user for certain input, add a ‘vars_prompt’ section. Here I have declared three variables named as “ip, port, and directory” respectively.

##ip:- For taking input of Name Node IP.

##port:- Port Number in which Hadoop cluster gonna work.

##directory:- Name-Node Directory.

2- copy:- Using this module for copying the software’s or rpm file of Hadoop and Java JDK that already present inside my Controller Node.

3- shell:- For running the Shell Commands inside the Name Node, this commands will download or install the Hadoop and Java inside the Name Node.

4- file:- Using this module for creating a Directory inside the Name Node which will act as Name Node Directory of Hadoop. Here I have used the variable “directory”.

5- blockinfile:- This module will configure my Hadoop Name Node by updating the both Configuration file of Hadoop named as “core-site.xml” and “hdfs-site.xml”. It will insert the Properties that I have mentioned after the word <configuration>.

6- Again using “shell” module for formatting my Name Node Directory and for starting the Name Node Services.

So finally, the Playbook is created.

STEP-4:- Now we have to run our Ansible Playbook. Command to run the playbook:-

ansible-playbook namenode.yml

Here after running first it has asked me the IP of my name node, then the Port Number and at last the Directory name.

If your playbook do not have any error then it will give the output like the above image. In my case changed is only 4 because I already tested my playbook before. But if you are running first time then you will see all the outputs in Yellow Color and if Red color output will come then your playbook has some error.

So, let’s see in my name node that all changes has done or not:-

1- Both the software’s are successfully copied and installed:-

2- Both the Conf file of my hadoop is successfully configured:-

3- Name Node Service is started:-

So finally, my Hadoop Name Node is configured successfully.

STEP-5:- Now we have to create same playbook for the Data Node Configuration. And also have to update the Inventory file with my Data Node IP. Command to create an playbook:-

vim datanode.yml

Here also I have used same modules of ansible. Some changes has done in hdfs-site.xml file configuration and we don’t have to format the Data Node Directory that’s why I removed one module from it.

STEP-6:- Run this playbook. Command:-

ansible-playbook datanode.yml

As you can see their are no error in my playbook, means all things has changed successfully.

Let’s see inside the Data Node:-

1- Both the files are copied and installed:-

2- Both the conf files are updated successfully:-

3- Service is started:-

NOTE:- IN YOUR CASE YOU MIGHT FACE SOME ISSUE WHILE STARTING THE DATA NODE SERVICE, FOR SOLVING THAT ISSUE EITHER STOP FIREWALLD SECURITY FROM BOTH THE NODES OR ENABLED THE “9001/TCP” PORT INSIDE THE FIREWALLD.

Finally my datanode is configured successfully, now we can see in name-node that one data-node is connected. Command to check:-

hadoop dfsadmin -report

<<HURRAY FINALLY WE COMPLETED OUR TASK>>

For Playbooks:-

Sarvjeet-Jain/Hadoop-Using-Ansible

Contribute to Sarvjeet-Jain/Hadoop-Using-Ansible development by creating an account on GitHub.

github.com

THANK YOU SO MUCH FOR READING THIS ARTICLE. HOPE IT HELPS YOU.

FOR MORE SUCH TYPE OF ARTICLES, STAY CONNECTED..

KEEP DOING AND KEEP SHARING