Tag : yarn

Tune Hadoop Cluster to get Maximum Performance (Part 2)

In previous part we have seen that how can we tune our operating system to get maximum performance for Hadoop, in this article I will be focusing on how to tune hadoop cluster to get performance boost on hadoop level :-)



Tune Hadoop Cluster to get Maximum Performance (Part 2) – http://crazyadmins.com


Before I actually start explaining tuning parameters let me cover some basic terms that are required to understand Hadoop level tuning.


What is YARN?

YARN – Yet another resource negotiator, this is Map-reduce version 2 with many new features such as dynamic memory assignment for mappers and reducers rather than having fixed slots etc.


What is Container?

Container represents allocated Resources like CPU, RAM etc. It’s a JVM process, in YARN AppMaster, Mapper and Reducer runs inside the Container.


Let’s get into the game now:


1. Resource Manager (RM) is responsible for allocating resources to mapreduce jobs.

2. For brand new Hadoop cluster (without any tuning) resource manager will get 8192MB (“yarn.nodemanager.resource.memory-mb”) memory per node only.

3. RM can allocate up to 8192 MB (“yarn.scheduler.maximum-allocation-mb”) to the Application Master container.

4. Default minimum allocation is 1024 MB (“yarn.scheduler.minimum-allocation-mb”).

5. The AM can only negotiate resources from Resource Manager that are in increments of (“yarn.scheduler.minimum-allocation-mb”) & it cannot exceed (“yarn.scheduler.maximum-allocation-mb”).

6. Application Master Rounds off (“mapreduce.map.memory.mb“) & (“mapreduce.reduce.memory.mb“) to a value devisable by (“yarn.scheduler.minimum-allocation-mb“).


What are these properties ? What can we tune ?



Default value is 1024m

Sets the minimum size of container that YARN will allow for running mapreduce jobs.




Default value is 8192m

The largest size of container that YARN will allow us to run the Mapreduce jobs.




Default value is 8GB
Total amount of physical memory (RAM) for Containers on worker node.

Set this property= Total RAM – (RAM for OS + Hadoop Daemons + Other services)




Default value is 2.1

The amount of virtual memory that each Container is allowed

This can be calculated with: containerMemoryRequest*vmem-pmem-ratio




These are the hard limits enforced by Hadoop on each mapper or reducer task. (Maximum memory that can be assigned to mapper or reducer’s container)

Default value – 1GB




The heapsize of the jvm –Xmx for the mapper or reducer task.

This value should always be lower than mapreduce.[map|reduce].memory.mb.

Recommended value is 80% of mapreduce.map.memory.mb/ mapreduce.reduce.memory.mb




The amount of memory for ApplicationMaster




heapsize for application Master




The number of cores that a node manager can allocate to containers is controlled by the yarn.nodemanager.resource.cpu-vcores property. It should be set to the total number of cores on the machine, minus a core for each daemon process running on the machine (datanode, node manager, and any other long-running processes).




Default value – 100MB

This is very important property to tune, when map task is in progress it writes output into a circular in-memory buffer. The size of this buffer is fixed and determined by io.sort.mb property

When this circular in-memory buffer gets filled (mapreduce.map. sort.spill.percent: 80% by default), the SPILLING to disk will start (in parallel using a separate thread). Notice that if the splilling thread is too slow and the buffer is 100% full, then the map cannot be executed and thus it has to wait.




Hadoop uses buffer size of 4KB by default for its I/O operations, we can increase it to 128K in order to get good performance and this value can be increased by setting io.file.buffer.size= 131072 (value in bytes) in core-site.xml




Short-circuit reads – When reading a file from HDFS, the client contacts the datanode and the data is sent to the client via a TCP connection. If the block being read is on the same node as the client, then it is more efficient for the client to bypass the network and read the block data directly from the disk.

We can enable short-circuit reads by setting this property to “true”




Default value is 10.

Now imagine the situation where map task is running, each time the memory buffer reaches the spill threshold, a new spill file is created, after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file.

The configuration property mapreduce.task.io.sort.factor controls the maximum number of streams to merge at once.




Default value is 5

The map output file is sitting on the local disk of the machine that ran the map task

The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes

The reduce task has a small number of copier threads so that it can fetch map outputs in parallel.

The default is five threads, but this number can be changed by setting the mapreduce.reduce.shuffle.parallelcopies property




I tried my best to cover as much as I can, there are plenty of things you can do for tuning! I hope this article was helpful to you. What I recommend you guys is try tuning above properties by considering total available memory capacity, total number of cores etc. and run the Teragen, Terasort etc. benchmarking tool to get the results, try tuning until you get best out of it!! :-)


facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Install multinode cloudera hadoop cluster cdh5.4.0 manually

This document will guide you regarding how to install multinode cloudera hadoop cluster cdh5.4.0 without Cloudera manager.


In this tutorial I have used 2 Centos 6.6 virtual machines viz. master.hadoop.com & slave.hadoop.com.




CentOS 6.X

jdk1.7.X is needed in order to get CDH working. If you have lower version of jdk, please uninstall it and install jdk1.7.X


Master machine – master.hadoop.com (

Daemons that we are going to install on master are :




Slave machine – slave.hadoop.com (

Daemons that we are going to install on master are :

Resource Manager (Yarn)


Secondary Namenode



Important configuration before proceeding further: please add both the hostname and ip information to /etc/hosts file on each host.


[root@master ~]# cat /etc/hosts   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6         master.hadoop.com         slave.hadoop.com



[root@slave ~]# cat /etc/hosts   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6         master.hadoop.com        slave.hadoop.com


Please verify that both the hosts are ping’able from each other :-)




Please stop the firewall and disable the selinux.


To stop firewall in centos :


service iptables stop && chkconfig iptables off


To disable selinux :


vim /etc/selinux/config


once file is opened, please verify that “SELINUX=disabled” is set.



1.Date should be in sync


Please make sure that master and slave machine’s date is in sync, if not please do it so by configuring NTP.


2.Passwordless ssh must be setup from master –> slave


To setup passwordless ssh follow below procedure :


2a. Generate rsa key pair using ssh-keygen command


[root@master conf]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)?


2b. Copy generated public key to slave.hadoop.com


[root@master conf]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave.hadoop.com


Now try logging into the machine, with “ssh ‘root@slave.hadoop.com'”, and check in:




to make sure we haven’t added extra keys that you weren’t expecting.


2c. Now try connecting to slave.hadoop.com using ssh


[root@master conf]# ssh root@slave.hadoop.com
Last login: Fri Apr 24 14:20:43 2015 from master.hadoop.com
[root@slave ~]# logout
Connection to slave.hadoop.com closed.
[root@master conf]#


That’s it! You have successfully configured passwordless ssh between master and slave node.


3. Internet connection


Please make sure that you have working internet connection, as we are going to download CDH packages in next steps.


4. Install cdh repo


4a. download cdh repo rpm


[root@master ~]# wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm



4b. install cdh repo downloaded in above step


[root@master ~]# yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
setting up Local Package Process



4c. do the same steps on slave node



[root@slave ~]# wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm



[root@slave ~]# yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Local Package Process



5. Install and deploy ZooKeeper.



[root@master ~]# yum -y install zookeeper-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Install Process



5a. create zookeeper dir and apply permissions



[root@master ~]# mkdir -p /var/lib/zookeeper
[root@master ~]# chown -R zookeeper /var/lib/zookeeper/



5b. Init zookeeper and start the service


[root@master ~]# service zookeeper-server init
No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-standalone



[root@master ~]# service zookeeper-server start
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED



6. Install namenode on master machine


yum -y install hadoop-hdfs-namenode


7. Install secondary namenode on slave machine


yum -y install hadoop-hdfs-secondarynamenode


8. Install resource manager on slave machine


yum -y install hadoop-yarn-resourcemanager


9. Install nodemanager, datanode & mapreduce on slave node


yum -y install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce


10. Install history server and yarn proxyserver on master machine


yum -y install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver


11. On both the machine you can install hadoop-client package


yum -y install hadoop-client


Now we are done with the installation, it’s time to deploy HDFS!


1. On each node, execute below commands :


[root@master ~]# cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
[root@master ~]# alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
[root@master ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster



[root@slave ~]# cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
[root@slave ~]# alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
[root@slave ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster



2. Let’s configure hdfs properties now :


Goto /etc/hadoop/conf/ dir on master node and edit below property files:


2a. vim /etc/hadoop/conf/core-site.xml


Add below lines in it under <configuration> tag





2b. vim /etc/hadoop/conf/hdfs-site.xml


The address and the base port on which the dfs NameNode Web UI will listen.



3. scp core-site.xml and hdfs-site.xml to slave.hadoop.com at /etc/hadoop/conf/


[root@master conf]# scp core-site.xml hdfs-site.xml slave.hadoop.com:/etc/hadoop/conf/
core-site.xml                                                                                                                                                 100% 1001     1.0KB/s   00:00
hdfs-site.xml                                                                                                                                                100% 1669     1.6KB/s   00:00
[root@master conf]#



4. Create local directories:


On master host run below commands:


mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn
chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn



On slave host run below commands:


mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
chown -R hdfs:hdfs /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn



5. Format the namenode :


sudo -u hdfs hdfs namenode -format



6. Start hdfs services


Run below commands on master and slave


for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do service $x start ; done


7. Create hdfs tmp dir


Run on any of the hadoop node


[root@slave ~]# sudo -u hdfs hadoop fs -mkdir /tmp
[root@slave ~]# sudo -u hdfs hadoop fs -chmod -R 1777 /tmp



Congratulations! You have deployed hdfs successfully :-)



Deploy Yarn


1. Prepare yarn configuration properties


replace your /etc/hadoop/conf/mapred-site.xml with below contents on master host


[root@master conf]# cat mapred-site.xml




2. Replace your /etc/hadoop/conf/yarn-site.xml with below contents on master host


[root@master conf]# cat yarn-site.xml

<description>List of directories to store localized files in.</description>
<description>Where to store container logs.</description>
<description>Where to aggregate logs to.</description>
<description>Classpath for typical applications.</description>



3. Copy modified files to slave machine.



[root@master conf]# scp mapred-site.xml yarn-site.xml slave.hadoop.com:/etc/hadoop/conf/
mapred-site.xml                                                                                                                                              100% 1086     1.1KB/s   00:00
yarn-site.xml                                                                                                                                                 100% 2787     2.7KB/s   00:00
[root@master conf]#




4. Configure local directories for yarn


To be done on yarn machine i.e. slave.hadoop.com in our case


[root@slave ~]# mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
[root@slave ~]# mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
[root@slave ~]# chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
[root@slave ~]# chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs



5. Configure the history server.


Add below properties in mapred-site.xml





6. Configure proxy settings for history server


Add below properties in /etc/hadoop/conf/core-site.xml





7. Copy modified files to slave.hadoop.com


[root@master conf]# scp mapred-site.xml core-site.xml slave.hadoop.com:/etc/hadoop/conf/
mapred-site.xml                                                                                                                                               100% 1299     1.3KB/s   00:00
core-site.xml                                                                                                                                                 100% 1174     1.2KB/s   00:00
[root@master conf]#



8. Create history directories and set permissions


[root@master conf]# sudo -u hdfs hadoop fs -mkdir -p /user/history
[root@master conf]# sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
[root@master conf]# sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history


9. Create log directories and set permissions


[root@master conf]# sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
[root@master conf]# sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn


10. Verify hdfs file structure


[root@master conf]# sudo -u hdfs hadoop fs -ls -R /
drwxrwxrwt   - hdfs hadoop         0 2015-04-25 01:16 /tmp
drwxr-xr-x   - hdfs hadoop         0 2015-04-25 02:52 /user
drwxrwxrwt   - mapred hadoop         0 2015-04-25 02:52 /user/history
drwxr-xr-x   - hdfs   hadoop         0 2015-04-25 02:53 /var
drwxr-xr-x   - hdfs   hadoop         0 2015-04-25 02:53 /var/log
drwxr-xr-x   - yarn   mapred         0 2015-04-25 02:53 /var/log/hadoop-yarn
[root@master conf]#



11. Start yarn and Jobhistory server


On slave.hadoop.com


[root@slave ~]# sudo service hadoop-yarn-resourcemanager start
starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-slave.hadoop.com.out
Started Hadoop resourcemanager:                           [ OK ]
[root@slave ~]#


[root@slave ~]# sudo service hadoop-yarn-nodemanager start
starting nodemanager, logging to /var/log/hadoop-yarn/yarn-yarn-nodemanager-slave.hadoop.com.out
Started Hadoop nodemanager:                               [ OK ]
[root@slave ~]#



On master.hadoop.com


[root@master conf]# sudo service hadoop-mapreduce-historyserver start
starting historyserver, logging to /var/log/hadoop-mapreduce/mapred-mapred-historyserver-master.hadoop.com.out
15/04/25 02:56:01 INFO hs.JobHistoryServer: STARTUP_MSG:
STARTUP_MSG: Starting JobHistoryServer
STARTUP_MSG:   host = master.hadoop.com/
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.6.0-cdh5.4.0
STARTUP_MSG:   classpath =
STARTUP_MSG:   build = http://github.com/cloudera/hadoop -r c788a14a5de9ecd968d1e2666e8765c5f018c271; compiled by 'jenkins' on 2015-04-21T19:18Z
STARTUP_MSG:   java = 1.7.0_79
Started Hadoop historyserver:                             [ OK ]
[root@master conf]#



12. Create user for running mapreduce jobs


[root@master conf]# sudo -u hdfs hadoop fs -mkdir /user/kuldeep
[root@master conf]# sudo -u hdfs hadoop fs -chown kuldeep /user/kuldeep




13. Important: Don’t forget to set core hadoop services to auto start when OS boot ups.


On master.hadoop.com


[root@master conf]# sudo chkconfig hadoop-hdfs-namenode on
[root@master conf]# sudo chkconfig hadoop-mapreduce-historyserver on


On slave.hadoop.com


[root@slave ~]# sudo chkconfig hadoop-yarn-resourcemanager on
[root@slave ~]# sudo chkconfig hadoop-hdfs-secondarynamenode on
[root@slave ~]# sudo chkconfig hadoop-yarn-nodemanager on
[root@slave ~]# sudo chkconfig hadoop-hdfs-datanode on



Final step : check UIs :-)


Namenode UI





Yarn UI






Job History Server UI






Secondary Namenode UI




facebooktwittergoogle_plusredditpinterestlinkedinmailby feather