Tag : yarn-deployment

Tune Hadoop Cluster to get Maximum Performance (Part 2)

In previous part we have seen that how can we tune our operating system to get maximum performance for Hadoop, in this article I will be focusing on how to tune hadoop cluster to get performance boost on hadoop level :-)

 

tune

Tune Hadoop Cluster to get Maximum Performance (Part 2) – http://crazyadmins.com

 

Before I actually start explaining tuning parameters let me cover some basic terms that are required to understand Hadoop level tuning.

 

What is YARN?

YARN – Yet another resource negotiator, this is Map-reduce version 2 with many new features such as dynamic memory assignment for mappers and reducers rather than having fixed slots etc.

 

What is Container?

Container represents allocated Resources like CPU, RAM etc. It’s a JVM process, in YARN AppMaster, Mapper and Reducer runs inside the Container.

 

Let’s get into the game now:

 

1. Resource Manager (RM) is responsible for allocating resources to mapreduce jobs.

2. For brand new Hadoop cluster (without any tuning) resource manager will get 8192MB (“yarn.nodemanager.resource.memory-mb”) memory per node only.

3. RM can allocate up to 8192 MB (“yarn.scheduler.maximum-allocation-mb”) to the Application Master container.

4. Default minimum allocation is 1024 MB (“yarn.scheduler.minimum-allocation-mb”).

5. The AM can only negotiate resources from Resource Manager that are in increments of (“yarn.scheduler.minimum-allocation-mb”) & it cannot exceed (“yarn.scheduler.maximum-allocation-mb”).

6. Application Master Rounds off (“mapreduce.map.memory.mb“) & (“mapreduce.reduce.memory.mb“) to a value devisable by (“yarn.scheduler.minimum-allocation-mb“).

 

What are these properties ? What can we tune ?

 

yarn.scheduler.minimum-allocation-mb

Default value is 1024m

Sets the minimum size of container that YARN will allow for running mapreduce jobs.

 

 

yarn.scheduler.maximum-allocation-mb

Default value is 8192m

The largest size of container that YARN will allow us to run the Mapreduce jobs.

 

 

yarn.nodemanager.resource.memory-mb

Default value is 8GB
Total amount of physical memory (RAM) for Containers on worker node.

Set this property= Total RAM – (RAM for OS + Hadoop Daemons + Other services)

 

 

yarn.nodemanager.vmem-pmem-ratio

Default value is 2.1

The amount of virtual memory that each Container is allowed

This can be calculated with: containerMemoryRequest*vmem-pmem-ratio

 

 

mapreduce.map.memory.mb 
mapreduce.reduce.memory.mb

These are the hard limits enforced by Hadoop on each mapper or reducer task. (Maximum memory that can be assigned to mapper or reducer’s container)

Default value – 1GB

 

 

mapreduce.map.java.opts
mapreduce.reduce.java.opts

The heapsize of the jvm –Xmx for the mapper or reducer task.

This value should always be lower than mapreduce.[map|reduce].memory.mb.

Recommended value is 80% of mapreduce.map.memory.mb/ mapreduce.reduce.memory.mb

 

 

yarn.app.mapreduce.am.resource.mb

The amount of memory for ApplicationMaster

 

 

yarn.app.mapreduce.am.command-opts

heapsize for application Master

 

 

yarn.nodemanager.resource.cpu-vcores

The number of cores that a node manager can allocate to containers is controlled by the yarn.nodemanager.resource.cpu-vcores property. It should be set to the total number of cores on the machine, minus a core for each daemon process running on the machine (datanode, node manager, and any other long-running processes).

 

 

mapreduce.task.io.sort.mb

Default value – 100MB

This is very important property to tune, when map task is in progress it writes output into a circular in-memory buffer. The size of this buffer is fixed and determined by io.sort.mb property

When this circular in-memory buffer gets filled (mapreduce.map. sort.spill.percent: 80% by default), the SPILLING to disk will start (in parallel using a separate thread). Notice that if the splilling thread is too slow and the buffer is 100% full, then the map cannot be executed and thus it has to wait.

 

 

io.file.buffer.size

Hadoop uses buffer size of 4KB by default for its I/O operations, we can increase it to 128K in order to get good performance and this value can be increased by setting io.file.buffer.size= 131072 (value in bytes) in core-site.xml

 

 

dfs.client.read.shortcircuit

Short-circuit reads – When reading a file from HDFS, the client contacts the datanode and the data is sent to the client via a TCP connection. If the block being read is on the same node as the client, then it is more efficient for the client to bypass the network and read the block data directly from the disk.

We can enable short-circuit reads by setting this property to “true”

 

 

mapreduce.task.io.sort.factor

Default value is 10.

Now imagine the situation where map task is running, each time the memory buffer reaches the spill threshold, a new spill file is created, after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file.

The configuration property mapreduce.task.io.sort.factor controls the maximum number of streams to merge at once.

 

 

mapreduce.reduce.shuffle.parallelcopies

Default value is 5

The map output file is sitting on the local disk of the machine that ran the map task

The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes

The reduce task has a small number of copier threads so that it can fetch map outputs in parallel.

The default is five threads, but this number can be changed by setting the mapreduce.reduce.shuffle.parallelcopies property

 

 

 

I tried my best to cover as much as I can, there are plenty of things you can do for tuning! I hope this article was helpful to you. What I recommend you guys is try tuning above properties by considering total available memory capacity, total number of cores etc. and run the Teragen, Terasort etc. benchmarking tool to get the results, try tuning until you get best out of it!! :-)

 

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Install multinode cloudera hadoop cluster cdh5.4.0 manually

This document will guide you regarding how to install multinode cloudera hadoop cluster cdh5.4.0 without Cloudera manager.

 

In this tutorial I have used 2 Centos 6.6 virtual machines viz. master.hadoop.com & slave.hadoop.com.

 

Prerequisites:

 

CentOS 6.X

jdk1.7.X is needed in order to get CDH working. If you have lower version of jdk, please uninstall it and install jdk1.7.X

 

Master machine – master.hadoop.com (192.168.111.130)

Daemons that we are going to install on master are :

Namenode

HistoryServer

 

Slave machine – slave.hadoop.com (192.168.111.131)

Daemons that we are going to install on master are :

Resource Manager (Yarn)

Node-manager

Secondary Namenode

Datanode

 

Important configuration before proceeding further: please add both the hostname and ip information to /etc/hosts file on each host.

 

[root@master ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.111.130         master.hadoop.com
192.168.111.131         slave.hadoop.com

 

 

[root@slave ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.111.130         master.hadoop.com
192.168.111.131        slave.hadoop.com

 

Please verify that both the hosts are ping’able from each other :-)

 

Also,

 

Please stop the firewall and disable the selinux.

 

To stop firewall in centos :

 

service iptables stop && chkconfig iptables off

 

To disable selinux :

 

vim /etc/selinux/config

 

once file is opened, please verify that “SELINUX=disabled” is set.

 

 

1.Date should be in sync

 

Please make sure that master and slave machine’s date is in sync, if not please do it so by configuring NTP.

 

2.Passwordless ssh must be setup from master –> slave

 

To setup passwordless ssh follow below procedure :

 

2a. Generate rsa key pair using ssh-keygen command

 

[root@master conf]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)?

 

2b. Copy generated public key to slave.hadoop.com

 

[root@master conf]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave.hadoop.com

 

Now try logging into the machine, with “ssh ‘root@slave.hadoop.com'”, and check in:

 

.ssh/authorized_keys

 

to make sure we haven’t added extra keys that you weren’t expecting.

 

2c. Now try connecting to slave.hadoop.com using ssh

 

[root@master conf]# ssh root@slave.hadoop.com
Last login: Fri Apr 24 14:20:43 2015 from master.hadoop.com
[root@slave ~]# logout
Connection to slave.hadoop.com closed.
[root@master conf]#

 

That’s it! You have successfully configured passwordless ssh between master and slave node.

 

3. Internet connection

 

Please make sure that you have working internet connection, as we are going to download CDH packages in next steps.

 

4. Install cdh repo

 

4a. download cdh repo rpm

 

[root@master ~]# wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

 

 

4b. install cdh repo downloaded in above step

 

[root@master ~]# yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
setting up Local Package Process
....
Complete!

 

 

4c. do the same steps on slave node

 

 

[root@slave ~]# wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

 

 

[root@slave ~]# yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Local Package Process
……
Complete!

 

 

5. Install and deploy ZooKeeper.

 

 

[root@master ~]# yum -y install zookeeper-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Install Process
…..
Complete!

 

 

5a. create zookeeper dir and apply permissions

 

 

[root@master ~]# mkdir -p /var/lib/zookeeper
[root@master ~]# chown -R zookeeper /var/lib/zookeeper/

 

 

5b. Init zookeeper and start the service

 

[root@master ~]# service zookeeper-server init
No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-standalone

 

 

[root@master ~]# service zookeeper-server start
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

 

 

6. Install namenode on master machine

 

yum -y install hadoop-hdfs-namenode

 

7. Install secondary namenode on slave machine

 

yum -y install hadoop-hdfs-secondarynamenode

 

8. Install resource manager on slave machine

 

yum -y install hadoop-yarn-resourcemanager

 

9. Install nodemanager, datanode & mapreduce on slave node

 

yum -y install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

 

10. Install history server and yarn proxyserver on master machine

 

yum -y install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

 

11. On both the machine you can install hadoop-client package

 

yum -y install hadoop-client

 

Now we are done with the installation, it’s time to deploy HDFS!

 

1. On each node, execute below commands :

 

[root@master ~]# cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
[root@master ~]# alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
[root@master ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

 

 

[root@slave ~]# cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
[root@slave ~]# alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
[root@slave ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

 

 

2. Let’s configure hdfs properties now :

 

Goto /etc/hadoop/conf/ dir on master node and edit below property files:

 

2a. vim /etc/hadoop/conf/core-site.xml

 

Add below lines in it under <configuration> tag

 

<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.com:8020</value>
</property>

 

 

2b. vim /etc/hadoop/conf/hdfs-site.xml

 

<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value></property>
<property>
<name>dfs.namenode.http-address</name>
<value>192.168.111.130:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>

 

 

3. scp core-site.xml and hdfs-site.xml to slave.hadoop.com at /etc/hadoop/conf/

 

[root@master conf]# scp core-site.xml hdfs-site.xml slave.hadoop.com:/etc/hadoop/conf/
core-site.xml                                                                                                                                                 100% 1001     1.0KB/s   00:00
hdfs-site.xml                                                                                                                                                100% 1669     1.6KB/s   00:00
[root@master conf]#

 

 

4. Create local directories:

 

On master host run below commands:

 

mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn
chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn

 

 

On slave host run below commands:

 

mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
chown -R hdfs:hdfs /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

 

 

5. Format the namenode :

 

sudo -u hdfs hdfs namenode -format

 

 

6. Start hdfs services

 

Run below commands on master and slave

 

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do service $x start ; done

 

7. Create hdfs tmp dir

 

Run on any of the hadoop node

 

[root@slave ~]# sudo -u hdfs hadoop fs -mkdir /tmp
[root@slave ~]# sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

 

 

Congratulations! You have deployed hdfs successfully :-)

 

 

Deploy Yarn

 

1. Prepare yarn configuration properties

 

replace your /etc/hadoop/conf/mapred-site.xml with below contents on master host

 

[root@master conf]# cat mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
</configuration>

 

 

2. Replace your /etc/hadoop/conf/yarn-site.xml with below contents on master host

 

[root@master conf]# cat yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>List of directories to store localized files in.</description>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
</property>
<property>
<description>Where to store container logs.</description>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///var/log/hadoop-yarn/containers</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://master.hadoop.com:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>slave.hadoop.com</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
</property>
</configuration>

 

 

3. Copy modified files to slave machine.

 

 

[root@master conf]# scp mapred-site.xml yarn-site.xml slave.hadoop.com:/etc/hadoop/conf/
mapred-site.xml                                                                                                                                              100% 1086     1.1KB/s   00:00
yarn-site.xml                                                                                                                                                 100% 2787     2.7KB/s   00:00
[root@master conf]#

 

 

 

4. Configure local directories for yarn

 

To be done on yarn machine i.e. slave.hadoop.com in our case

 

[root@slave ~]# mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
[root@slave ~]# mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
[root@slave ~]# chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
[root@slave ~]# chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs

 

 

5. Configure the history server.

 

Add below properties in mapred-site.xml

 

<property>
<name>mapreduce.jobhistory.address</name>
<value>master.hadoop.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master.hadoop.com:19888</value>
</property>

 

 

6. Configure proxy settings for history server

 

Add below properties in /etc/hadoop/conf/core-site.xml

 

<property>
<name>hadoop.proxyuser.mapred.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.hosts</name>
<value>*</value>
</property>

 

 

7. Copy modified files to slave.hadoop.com

 

[root@master conf]# scp mapred-site.xml core-site.xml slave.hadoop.com:/etc/hadoop/conf/
mapred-site.xml                                                                                                                                               100% 1299     1.3KB/s   00:00
core-site.xml                                                                                                                                                 100% 1174     1.2KB/s   00:00
[root@master conf]#

 

 

8. Create history directories and set permissions

 

[root@master conf]# sudo -u hdfs hadoop fs -mkdir -p /user/history
[root@master conf]# sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
[root@master conf]# sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history

 

9. Create log directories and set permissions

 

[root@master conf]# sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
[root@master conf]# sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

 

10. Verify hdfs file structure

 

[root@master conf]# sudo -u hdfs hadoop fs -ls -R /
drwxrwxrwt   - hdfs hadoop         0 2015-04-25 01:16 /tmp
drwxr-xr-x   - hdfs hadoop         0 2015-04-25 02:52 /user
drwxrwxrwt   - mapred hadoop         0 2015-04-25 02:52 /user/history
drwxr-xr-x   - hdfs   hadoop         0 2015-04-25 02:53 /var
drwxr-xr-x   - hdfs   hadoop         0 2015-04-25 02:53 /var/log
drwxr-xr-x   - yarn   mapred         0 2015-04-25 02:53 /var/log/hadoop-yarn
[root@master conf]#

 

 

11. Start yarn and Jobhistory server

 

On slave.hadoop.com

 

[root@slave ~]# sudo service hadoop-yarn-resourcemanager start
starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-slave.hadoop.com.out
Started Hadoop resourcemanager:                           [ OK ]
[root@slave ~]#

 

[root@slave ~]# sudo service hadoop-yarn-nodemanager start
starting nodemanager, logging to /var/log/hadoop-yarn/yarn-yarn-nodemanager-slave.hadoop.com.out
Started Hadoop nodemanager:                               [ OK ]
[root@slave ~]#

 

 

On master.hadoop.com

 

[root@master conf]# sudo service hadoop-mapreduce-historyserver start
starting historyserver, logging to /var/log/hadoop-mapreduce/mapred-mapred-historyserver-master.hadoop.com.out
15/04/25 02:56:01 INFO hs.JobHistoryServer: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting JobHistoryServer
STARTUP_MSG:   host = master.hadoop.com/192.168.111.130
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.6.0-cdh5.4.0
STARTUP_MSG:   classpath =
STARTUP_MSG:   build = http://github.com/cloudera/hadoop -r c788a14a5de9ecd968d1e2666e8765c5f018c271; compiled by 'jenkins' on 2015-04-21T19:18Z
STARTUP_MSG:   java = 1.7.0_79
-
-
-
************************************************************/
Started Hadoop historyserver:                             [ OK ]
[root@master conf]#

 

 

12. Create user for running mapreduce jobs

 

[root@master conf]# sudo -u hdfs hadoop fs -mkdir /user/kuldeep
[root@master conf]# sudo -u hdfs hadoop fs -chown kuldeep /user/kuldeep

 

 

 

13. Important: Don’t forget to set core hadoop services to auto start when OS boot ups.

 

On master.hadoop.com

 

[root@master conf]# sudo chkconfig hadoop-hdfs-namenode on
[root@master conf]# sudo chkconfig hadoop-mapreduce-historyserver on

 

On slave.hadoop.com

 

[root@slave ~]# sudo chkconfig hadoop-yarn-resourcemanager on
[root@slave ~]# sudo chkconfig hadoop-hdfs-secondarynamenode on
[root@slave ~]# sudo chkconfig hadoop-yarn-nodemanager on
[root@slave ~]# sudo chkconfig hadoop-hdfs-datanode on

 

 

Final step : check UIs :-)

 

Namenode UI

 

namenode

 

 

Yarn UI

 

 

yarn

 

 

Job History Server UI

 

 

jobhistory

 

 

Secondary Namenode UI

 

 

sec_nn

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather