Archive for : April, 2015

Setting up Hortonworks Hadoop cluster in AWS

In this article we will discuss how to set up Hortonworks Hadoop cluster in AWS (Amazon Web Services).

Assuming you have a valid AWS login let us get started with:

  1. Launching an Amazon instance
  2. Pre-requisites for setting up Hadoop cluster in AWS
  3. Hadoop cluster Installation (via Ambari)

 

1. Launching an Amazon instance

 

a) Select EC2 from you Amazon Management Console.

b) Next step is to create the instance. Click on “Launch Instance”

c) We are going to use Centos 6.5 from the AWS Marketplace

 

1

 

d) Select Instance type. For this exercise we will use m3.xlarge (Please select appropriate instance as per your requirement and budget from http://aws.amazon.com/ec2/pricing/)

e) Now configure Instance details. We chose to launch 2 instances and rest all default values as below.

f) Add storage. This will be used for your / volume.

 

4

 

 

 

 

 

 

 

 

 

 

g) Tag your instance

h) Configure security group now. For hadoop we need TCP and ICMP ports and all http ports for various UIs to be open. Thus for this exercise we open all TCP and ICMP ports but this can be restricted to only required ports.

 

6

 

i) ICMP port was added later after launching instance. Yes, this is doable! You can edit the security group settings later as well.

 

11

 

j) Click on “Review and Launch” now.

k) It will ask you to create a new keypair and download it (.pem file) to your local machine. This is used to setup passwordless ssh in hadoop cluster and also required by the management UI.

l) Once your instances are launched, on your EC2 Dashboard you will see the details as below. You can rename you instances here for you reference but these won’t be the hostnames of your instances J

m) Please note down the Public DNS, Public IP, Private IP of your instances since these will be required later.

 

8

 

With this we complete the first part of Launching an Amazon instance successfully! :)

 

2. Pre-requisites for setting up Hadoop cluster in AWS

 

Below listed items are very necessary before we move ahead with setting up Hadoop cluster.

 

Generate .ppk (private key) from the previously download .pem file

1. Use puttygen to do this. Import the .pem file. Generate the keys and save the private key (.ppk).

2. Now connect to putty and use the public DNS/public IP to connect to master host. Don’t forget to import the .ppk (private key) in putty.

3. Login as “root”

 

Password-less SSH access among servers in cluster:

 

1. Use WinSCP utility copy the .pem file to your master host. This is used for passwordless SSH from master to slave nodes for starting services remotely.

2. While connecting from WinSCP to master host – provide the Public DNS/Public IP and username as “root”. Instead of password pass the .ppk for connecting by clicking on the advanced button.

3. Public key for this is already put in ~/.ssh/authorized_keys by AWS while launching your instances.

4. chmod 644 ~/.ssh/authorized_keys

5. chmod 400 ~/.ssh/<your .pem file>

6. Once you are logged in run the below 2 commands on master:

6.1 eval `ssh-agent` (this is tilde sign and not single quote!)

6.2 ssh-add ~/.ssh/<your .pem file>

7. Check now whether you can ssh to you slave node from your master without password. It should work.

8. Please remember this will be lost upon shell exit and you have repeat ssh-agent and ssh-add commands.

 

Change hostname

 

1. Here you can change the hostname to the Public DNS or something you like and is easy to remember. We will give something that is easy to remember because once you stop your EC2 instance, its Public DNS and Public IP changes. So this will cause you extra work of updating hosts file every-time and also disrupt your cluster state on startup every-time.

2. If you wish to grant public access to your host then chose updating hostname to Public DNS/IP.

3. Issue command : hostname <chose your hostname>

4. Repeat above step on all hosts!

 

Update /etc/hosts file

 

1. Edit /etc/hosts using “vi” or any other editor and add the mapping of Private IP and hostname given above. (Private IP is obtained from ifconfig or even on AWS EC2 console-instances details as noted earlier)

2. Update this file to have IP and hostname of all your servers in cluster

3. Repeat above two steps on all hosts!

 

Date/Time should be in sync

 

Check that your master and slave nodes’ time and date are in sync else please configure NTP to do so.

 

Install Java on master

 

yum install java-1.7.0-openjdk

 

Disable selinux

 

1. Ensure that in /etc/selinux/config file, “SELINUX=disabled”

Note : Please Repeat above step on all hosts!

 

Firewall should NOT be running

 

1. service iptables stop

2. chkconfig iptables off (This is to ensure that it does not start again on reboot)

3. Repeat above two steps on all hosts!

 

Install HDP and ambari Repositories

 

1. wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.2.0.0/hdp.repo -O /etc/yum.repos.d/HDP.repo

2. wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.7.0/ambari.repo -O /etc/yum.repos.d/ambari.repo

3. Repeat above two steps on all hosts!

 

 

Hadoop cluster Installation (via Ambari)

 

We are going to setup Amabri server and then proceed to install Hadoop and other components as required, using the Ambari UI.

 

1. yum install ambari-server

2. ambari-server setup (use defaults whenever asked for input during this setup)

3. ambari-server start     (starting the ambari server)

4. Now you can access the Ambari UI in your browser using the public DNS/Public IP and port 8080

5. Login with username: admin and password: admin (This is default login. You can change this via UI)

6. Launce the cluster and proceed the installation as per your requirement.

7. Remember below things:

7.1 Register hosts with the hostname you have set ( Public DNS/IP or anything other name as provided by you)

7.2 While registering hosts in UI, import/upload the .pem file and not the .ppk otherwise registration of hosts will fail.

7.3 If it fails with error regarding openssl, then please update openssl libraries on all your hosts.

8. After successful deployment, you can now check the various UIs and run some test jobs on your cluster! Good luck :)

And yes, stop the instances once required work is done to avoid unnecessary billing! Enjoy Hadooping! :)

 

 

 

 

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Install multinode cloudera hadoop cluster cdh5.4.0 manually

This document will guide you regarding how to install multinode cloudera hadoop cluster cdh5.4.0 without Cloudera manager.

 

In this tutorial I have used 2 Centos 6.6 virtual machines viz. master.hadoop.com & slave.hadoop.com.

 

Prerequisites:

 

CentOS 6.X

jdk1.7.X is needed in order to get CDH working. If you have lower version of jdk, please uninstall it and install jdk1.7.X

 

Master machine – master.hadoop.com (192.168.111.130)

Daemons that we are going to install on master are :

Namenode

HistoryServer

 

Slave machine – slave.hadoop.com (192.168.111.131)

Daemons that we are going to install on master are :

Resource Manager (Yarn)

Node-manager

Secondary Namenode

Datanode

 

Important configuration before proceeding further: please add both the hostname and ip information to /etc/hosts file on each host.

 

[root@master ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.111.130         master.hadoop.com
192.168.111.131         slave.hadoop.com

 

 

[root@slave ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.111.130         master.hadoop.com
192.168.111.131        slave.hadoop.com

 

Please verify that both the hosts are ping’able from each other :-)

 

Also,

 

Please stop the firewall and disable the selinux.

 

To stop firewall in centos :

 

service iptables stop && chkconfig iptables off

 

To disable selinux :

 

vim /etc/selinux/config

 

once file is opened, please verify that “SELINUX=disabled” is set.

 

 

1.Date should be in sync

 

Please make sure that master and slave machine’s date is in sync, if not please do it so by configuring NTP.

 

2.Passwordless ssh must be setup from master –> slave

 

To setup passwordless ssh follow below procedure :

 

2a. Generate rsa key pair using ssh-keygen command

 

[root@master conf]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)?

 

2b. Copy generated public key to slave.hadoop.com

 

[root@master conf]# ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave.hadoop.com

 

Now try logging into the machine, with “ssh ‘root@slave.hadoop.com'”, and check in:

 

.ssh/authorized_keys

 

to make sure we haven’t added extra keys that you weren’t expecting.

 

2c. Now try connecting to slave.hadoop.com using ssh

 

[root@master conf]# ssh root@slave.hadoop.com
Last login: Fri Apr 24 14:20:43 2015 from master.hadoop.com
[root@slave ~]# logout
Connection to slave.hadoop.com closed.
[root@master conf]#

 

That’s it! You have successfully configured passwordless ssh between master and slave node.

 

3. Internet connection

 

Please make sure that you have working internet connection, as we are going to download CDH packages in next steps.

 

4. Install cdh repo

 

4a. download cdh repo rpm

 

[root@master ~]# wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

 

 

4b. install cdh repo downloaded in above step

 

[root@master ~]# yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
setting up Local Package Process
....
Complete!

 

 

4c. do the same steps on slave node

 

 

[root@slave ~]# wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

 

 

[root@slave ~]# yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Local Package Process
……
Complete!

 

 

5. Install and deploy ZooKeeper.

 

 

[root@master ~]# yum -y install zookeeper-server
Loaded plugins: fastestmirror, refresh-packagekit, security
Setting up Install Process
…..
Complete!

 

 

5a. create zookeeper dir and apply permissions

 

 

[root@master ~]# mkdir -p /var/lib/zookeeper
[root@master ~]# chown -R zookeeper /var/lib/zookeeper/

 

 

5b. Init zookeeper and start the service

 

[root@master ~]# service zookeeper-server init
No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-standalone

 

 

[root@master ~]# service zookeeper-server start
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

 

 

6. Install namenode on master machine

 

yum -y install hadoop-hdfs-namenode

 

7. Install secondary namenode on slave machine

 

yum -y install hadoop-hdfs-secondarynamenode

 

8. Install resource manager on slave machine

 

yum -y install hadoop-yarn-resourcemanager

 

9. Install nodemanager, datanode & mapreduce on slave node

 

yum -y install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

 

10. Install history server and yarn proxyserver on master machine

 

yum -y install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

 

11. On both the machine you can install hadoop-client package

 

yum -y install hadoop-client

 

Now we are done with the installation, it’s time to deploy HDFS!

 

1. On each node, execute below commands :

 

[root@master ~]# cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
[root@master ~]# alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
[root@master ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

 

 

[root@slave ~]# cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
[root@slave ~]# alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
[root@slave ~]# alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

 

 

2. Let’s configure hdfs properties now :

 

Goto /etc/hadoop/conf/ dir on master node and edit below property files:

 

2a. vim /etc/hadoop/conf/core-site.xml

 

Add below lines in it under <configuration> tag

 

<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.com:8020</value>
</property>

 

 

2b. vim /etc/hadoop/conf/hdfs-site.xml

 

<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value></property>
<property>
<name>dfs.namenode.http-address</name>
<value>192.168.111.130:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>

 

 

3. scp core-site.xml and hdfs-site.xml to slave.hadoop.com at /etc/hadoop/conf/

 

[root@master conf]# scp core-site.xml hdfs-site.xml slave.hadoop.com:/etc/hadoop/conf/
core-site.xml                                                                                                                                                 100% 1001     1.0KB/s   00:00
hdfs-site.xml                                                                                                                                                100% 1669     1.6KB/s   00:00
[root@master conf]#

 

 

4. Create local directories:

 

On master host run below commands:

 

mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn
chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn

 

 

On slave host run below commands:

 

mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
chown -R hdfs:hdfs /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

 

 

5. Format the namenode :

 

sudo -u hdfs hdfs namenode -format

 

 

6. Start hdfs services

 

Run below commands on master and slave

 

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do service $x start ; done

 

7. Create hdfs tmp dir

 

Run on any of the hadoop node

 

[root@slave ~]# sudo -u hdfs hadoop fs -mkdir /tmp
[root@slave ~]# sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

 

 

Congratulations! You have deployed hdfs successfully :-)

 

 

Deploy Yarn

 

1. Prepare yarn configuration properties

 

replace your /etc/hadoop/conf/mapred-site.xml with below contents on master host

 

[root@master conf]# cat mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
</configuration>

 

 

2. Replace your /etc/hadoop/conf/yarn-site.xml with below contents on master host

 

[root@master conf]# cat yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>List of directories to store localized files in.</description>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
</property>
<property>
<description>Where to store container logs.</description>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///var/log/hadoop-yarn/containers</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://master.hadoop.com:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>slave.hadoop.com</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
</property>
</configuration>

 

 

3. Copy modified files to slave machine.

 

 

[root@master conf]# scp mapred-site.xml yarn-site.xml slave.hadoop.com:/etc/hadoop/conf/
mapred-site.xml                                                                                                                                              100% 1086     1.1KB/s   00:00
yarn-site.xml                                                                                                                                                 100% 2787     2.7KB/s   00:00
[root@master conf]#

 

 

 

4. Configure local directories for yarn

 

To be done on yarn machine i.e. slave.hadoop.com in our case

 

[root@slave ~]# mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
[root@slave ~]# mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
[root@slave ~]# chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
[root@slave ~]# chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs

 

 

5. Configure the history server.

 

Add below properties in mapred-site.xml

 

<property>
<name>mapreduce.jobhistory.address</name>
<value>master.hadoop.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master.hadoop.com:19888</value>
</property>

 

 

6. Configure proxy settings for history server

 

Add below properties in /etc/hadoop/conf/core-site.xml

 

<property>
<name>hadoop.proxyuser.mapred.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.hosts</name>
<value>*</value>
</property>

 

 

7. Copy modified files to slave.hadoop.com

 

[root@master conf]# scp mapred-site.xml core-site.xml slave.hadoop.com:/etc/hadoop/conf/
mapred-site.xml                                                                                                                                               100% 1299     1.3KB/s   00:00
core-site.xml                                                                                                                                                 100% 1174     1.2KB/s   00:00
[root@master conf]#

 

 

8. Create history directories and set permissions

 

[root@master conf]# sudo -u hdfs hadoop fs -mkdir -p /user/history
[root@master conf]# sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
[root@master conf]# sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history

 

9. Create log directories and set permissions

 

[root@master conf]# sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
[root@master conf]# sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

 

10. Verify hdfs file structure

 

[root@master conf]# sudo -u hdfs hadoop fs -ls -R /
drwxrwxrwt   - hdfs hadoop         0 2015-04-25 01:16 /tmp
drwxr-xr-x   - hdfs hadoop         0 2015-04-25 02:52 /user
drwxrwxrwt   - mapred hadoop         0 2015-04-25 02:52 /user/history
drwxr-xr-x   - hdfs   hadoop         0 2015-04-25 02:53 /var
drwxr-xr-x   - hdfs   hadoop         0 2015-04-25 02:53 /var/log
drwxr-xr-x   - yarn   mapred         0 2015-04-25 02:53 /var/log/hadoop-yarn
[root@master conf]#

 

 

11. Start yarn and Jobhistory server

 

On slave.hadoop.com

 

[root@slave ~]# sudo service hadoop-yarn-resourcemanager start
starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-slave.hadoop.com.out
Started Hadoop resourcemanager:                           [ OK ]
[root@slave ~]#

 

[root@slave ~]# sudo service hadoop-yarn-nodemanager start
starting nodemanager, logging to /var/log/hadoop-yarn/yarn-yarn-nodemanager-slave.hadoop.com.out
Started Hadoop nodemanager:                               [ OK ]
[root@slave ~]#

 

 

On master.hadoop.com

 

[root@master conf]# sudo service hadoop-mapreduce-historyserver start
starting historyserver, logging to /var/log/hadoop-mapreduce/mapred-mapred-historyserver-master.hadoop.com.out
15/04/25 02:56:01 INFO hs.JobHistoryServer: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting JobHistoryServer
STARTUP_MSG:   host = master.hadoop.com/192.168.111.130
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.6.0-cdh5.4.0
STARTUP_MSG:   classpath =
STARTUP_MSG:   build = http://github.com/cloudera/hadoop -r c788a14a5de9ecd968d1e2666e8765c5f018c271; compiled by 'jenkins' on 2015-04-21T19:18Z
STARTUP_MSG:   java = 1.7.0_79
-
-
-
************************************************************/
Started Hadoop historyserver:                             [ OK ]
[root@master conf]#

 

 

12. Create user for running mapreduce jobs

 

[root@master conf]# sudo -u hdfs hadoop fs -mkdir /user/kuldeep
[root@master conf]# sudo -u hdfs hadoop fs -chown kuldeep /user/kuldeep

 

 

 

13. Important: Don’t forget to set core hadoop services to auto start when OS boot ups.

 

On master.hadoop.com

 

[root@master conf]# sudo chkconfig hadoop-hdfs-namenode on
[root@master conf]# sudo chkconfig hadoop-mapreduce-historyserver on

 

On slave.hadoop.com

 

[root@slave ~]# sudo chkconfig hadoop-yarn-resourcemanager on
[root@slave ~]# sudo chkconfig hadoop-hdfs-secondarynamenode on
[root@slave ~]# sudo chkconfig hadoop-yarn-nodemanager on
[root@slave ~]# sudo chkconfig hadoop-hdfs-datanode on

 

 

Final step : check UIs :-)

 

Namenode UI

 

namenode

 

 

Yarn UI

 

 

yarn

 

 

Job History Server UI

 

 

jobhistory

 

 

Secondary Namenode UI

 

 

sec_nn

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Apache Ranger installation and Configuration in HDP2.2

Apache Ranger installation and Configuration in HDP2.2

 

In this tutorial I am going to cover how to install and configure Ranger on hortonworks hadoop platform 2.2.

 

What is Ranger?

 

It provides central security policy administration in a Hadoop environment. It covers 3 aspects:

 

Authentication : by the Apache Knox Gateway via the HTTP/REST API

Authorization : Fine-grained access control provides flexibility in defining policies on:

  1. folder and file level, via HDFS
  2. database, table and column level, via Hive
  3. table, column family and column level, via HBase

 

Audit          : Controls access into the system via extensive user access auditing in HDFS, Hive and HBase

 

Installation and Configuration:

 

Let us first see what are the available Ranger packages (optional)

Note – plugins below with orange colour are currently available for ranger.

[root@hdpcm ~]# yum search ranger

Loaded plugins: fastestmirror, priorities, security

Loading mirror speeds from cached hostfile

* base: centos.bytenet.in

* extras: centos.bytenet.in

* updates: centos.bytenet.in

================================================================= N/S Matched: ranger =================================================================

ranger.noarch : ranger HDP virtual package

ranger-admin.noarch : ranger-admin HDP virtual package

ranger-debuginfo.noarch : ranger-debuginfo HDP virtual package

ranger-hbase-plugin.noarch : ranger-hbase-plugin HDP virtual package

ranger-hdfs-plugin.noarch : ranger-hdfs-plugin HDP virtual package

ranger-hive-plugin.noarch : ranger-hive-plugin HDP virtual package

ranger-knox-plugin.noarch : ranger-knox-plugin HDP virtual package

ranger-storm-plugin.noarch : ranger-storm-plugin HDP virtual package

ranger-usersync.noarch : ranger-usersync HDP virtual package

ranger_2_2_0_0_2041-admin.x86_64 : Web Interface for Ranger

ranger_2_2_0_0_2041-debuginfo.x86_64 : Debug information for package ranger_2_2_0_0_2041

ranger_2_2_0_0_2041-hbase-plugin.x86_64 : ranger plugin for hbase

ranger_2_2_0_0_2041-hdfs-plugin.x86_64 : ranger plugin for hdfs

ranger_2_2_0_0_2041-hive-plugin.x86_64 : ranger plugin for hive

ranger_2_2_0_0_2041-knox-plugin.x86_64 : ranger plugin for knox

ranger_2_2_0_0_2041-storm-plugin.x86_64 : ranger plugin for storm

ranger_2_2_0_0_2041-usersync.x86_64 : Synchronize User/Group information from Corporate LD/AD or Unix

 

Name and summary matches only, use “search all” for everything.

 

Now let us start –

Step 1: Go ahead and install Ranger

  1. yum install ranger-admin
  2. yum install ranger-usersync
  3. yum install ranger-hdfs-plugin
  4. yum install ranger-hive-plugin
  5. set JAVA_HOME

 

export JAVA_HOME=/usr/jdk64/jdk1.7.0_67 (substitute this with jdk path on your system)

echo “export JAVA_HOME=/usr/jdk64/jdk1.7.0_67″ >> ~/.bashrc

 

Step2: Set up the ranger admin UI

 

We need to run the setup script present at “/usr/hdp/current/ranger-admin” location. It will –

 

  1. add ranger user and group.
  2. set up ranger DB (Please ensure you know your MySQL root password since it will ask for it while setting up the ranger DB)
  3. create rangeradmin and rangerlogger MySQL users with appropriate grants.

 

Besides MySQL root password, whenever it prompts for password for setting up ranger and audit DB, please enter ‘hortonworks’ or anything else you wish. Just remember it for future use.

 

[root@hdpcm ranger-admin]# pwd

/usr/hdp/current/ranger-admin

 

[root@hdpcm ranger-admin]# ./setup.sh

[2015/03/31 15:58:41]:   ——— Running XASecure PolicyManager Web Application Install Script ———

[2015/03/31 15:58:41]: [I] uname=Linux

[2015/03/31 15:58:41]: [I] hostname=hdpcm.dm.com

[2015/03/31 15:58:41]: [I] DB_FLAVOR=MYSQL

~

~

~

Installation of XASecure PolicyManager Web Application is completed.

 

Step 3: Start ranger-admin service

 

[root@hdpcm ews]# pwd

/usr/hdp/current/ranger-admin/ews

 

[root@hdpcm ews]# sh start-ranger-admin.sh

Apache Ranger Admin has started

[root@hdpcm ews]#

 

Logs available at : /usr/hdp/current/ranger-admin/ews/logs

 

Step 4: Setup up ranger-usersync

By default it will sync UNIX users to the Ranger UI. You can also sync it with LDAP. This article syncs UNIX users.

 

  1. Edit /usr/hdp/current/ranger-usersync/install.properties file.
  2. Update “POLICY_MGR_URL” to point to your ranger host:

POLICY_MGR_URL = http://<IP of your Ranger host>:6080

 

Now run /usr/hdp/current/ranger-usersync/setup.sh

 

Step 5: Start the ranger-usersync service

 

[root@hdpcm ranger-usersync]# pwd

/usr/hdp/current/ranger-usersync

 

[root@hdpcm ranger-usersync]# sh start.sh

Starting UnixAuthenticationService

UnixAuthenticationService has started successfully.

 

Congratulations!! You have installed and configured Ranger successfully :)

 

Now Login to the Ranger Web UI by hitting below URL:

http://<ranger-host>:6080

 

Default password for admin user is “admin”. Once you login you can change this admin password via profile settings

 

1

 

Once you log in successfully, you will see below page:

 

2

 

In next article, I will discuss more about setting up policies for HDFS/Hive etc. via Ranger. Stay tuned for more updates! :-)

 

Please feel free to comment or email me if you have any questions or doubts.

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Install and Configure Transparent Data Encryption in hadoop HDP 2.2

Hey Guys hope you all are doing well :-) today I’m going to explain you how to install and configure transparent data encryption in hadoop – HDP2.2.

 

Why do we need Transparent Encryption in HDFS?

 

  • When we want to encrypt only selected files/directories in HDFS to save on overhead and protect performance – now this is possible with HDFS Transparent Data Encryption (TDE).

 

  • HDFS TDE allows users to take advantage of HDFS native data encryption without any application code changes.

 

  • Once an HDFS admin sets up encryption, HDFS takes care of the actual encryption/decryption without the end-user having to manually encrypt/decrypt a file.

 

 

Building blocks of TDE (Transparent Data encryption)

 

  1. Encryption Zone (EZ): An HDFS admin creates an encryption zone and links it to an empty HDFS directory and an encryption key. Any files put in the directory are automatically encrypted by HDFS.

 

  1. Key Management Server (KMS): KMS is responsible for storing encryption key. KMS provides a REST API and access control on keys stored in the KMS.

 

  1. Key Provider API: The Key Provider API is the glue used by HDFS Name Node and Client to connect with the Key Management Server.

 

Accessing data within an encryption zone

 

EZ Key – Encrypted zone key

DEK – Data encrypted keys

EDED – Encrypted DEK

 

  • When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone’s key. The EDEK is then stored persistently as part of the file’s metadata on the NameNode.

 

  • When reading a file within an encryption zone, the NameNode provides the client with the file’s EDEK and the encryption zone key version used to encrypt the EDEK. The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. Assuming that is successful, the client uses the DEK to decrypt the file’s contents.

 

  • All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.

 

  • Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys. However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.

 

 

Use Cases of TDE

 

  • Data encryption is required by a number of different government, financial, and regulatory entities. For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.

 

  • Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions.

 

Installation steps:

 

Step 1: Configure the Key Management Service (KMS)

 

Extract the Key Management Server bits from the package included in Apache Hadoop

 

# mkdir -p /usr/kms-demo# cp /usr/hdp/current/hadoop-client/mapreduce.tar.gz /usr/kms-demo/# export KMS_ROOT=/usr/kms-demo

Where KMS_ROOT refers to the directory where mapreduce.tar.gz has been extracted (/usr/kms-demo)

# cd $KMS_ROOT# tar -xvf mapreduce.tar.gz

 

Start the Key Management Server

Appendix A covers advanced configuration of the Key Management Server. The following basic scenario uses the default configurations:

# cd $KMS_ROOT/hadoop/sbin/# ./kms.sh run

You’ll see the following console output on a successful start: 

Jan 10, 2015 11:07:33 PM org.apache.catalina.startup.Catalina startINFO: Server startup in 1764 ms

 

Step 2: Configure Hadoop to use the KMS as the key provider

Hadoop configuration can be managed through either Ambari or through manipulating the XML configuration files. Both options are shown here.

Configure Hadoop to use KMS using Ambari

You can use Ambari to configure this in the HDFS configuration section.

Login in to Ambari through your web browser (admin/admin):

1

 

On the Ambari Dashboard, click HDFS service and then the “Configs” tab.

 

2

 

Add the following custom properties for HADOOP key management and HDFS encryption zone feature to find the right KMS key provider:

 

  •       Custom core-site

Add property “hadoop.security.key.provider.path”  with value “kms://http@localhost:16000/kms” 

 

3

 

Make sure to match the host of the node where you started KMS to the value in kms://http@localhost:16000/kms

Step 3: Save the configuration and restart HDFS after setting these properties.

 

Create Encryption Keys

Log into the Sandbox as the hdfs superuser. Run the following commands to create a key named “key1″ with length of 256 and show the result:

# su – hdfs

# hadoop key create key1  -size 256

# hadoop key list –metadata

As an Admin, Create an Encryption Zone in HDFS

 

Run the following commands to create an encryption zone under /secureweblogs with zone key named “key1” and show the results:

# hdfs dfs -mkdir /secureweblogs

# hdfs crypto -createZone -keyName key1 -path /secureweblogs

# hdfs crypto -listZones

Note: Crypto command requires the HDFS superuser privilege

Please refer to Implementation and testing section to see how to test our setup.

 

 

Implementation & Testing

 

Below are the results of implementation & testing of TDE on hortonworks sandbox!

Details – Installed and configured KMS, Created 2 encrypted zones on hdfs

 

  1. /Secureweblogs with EZ key key1
  2. /Huezone with EZ key hueKey

 

Output:

[hdfs@sandbox conf]$ hdfs crypto -listZones

/secureweblogs key1

/Huezone       hueKey

 

 

As the ‘hive’ user, you can transparently write data to that directory.

 

Output :

[hive@sandbox ~]# hdfs dfs -copyFromLocal web.log /secureweblogs

[hive@sandbox ~]# hdfs dfs -ls /secureweblogs

Found 1 items

-rw-r–r–   1 hive hive       1310 2015-01-11 23:28 /secureweblogs/web.log

 

As the ‘hive’ user, you can transparently read data from that directory, and verify that the exact file that was loaded into HDFS is readable in its unencrypted form.

 

 

[hive@sandbox ~]# hdfs dfs -copyToLocal /secureweblogs/web.log read.log

[hive@sandbox ~]# diff web.log read.log

[hive@sandbox conf]$ hadoop fs -cat /secureweblogs/web.log

this is web log

[hive@sandbox conf]$

 

Other users will not be able to write data or read from the encrypted zone:

 

[root@sandbox conf]# hadoop fs -cat /secureweblogs/web.log

cat: Permission denied: user=root, access=EXECUTE, inode=”/secureweblogs”:hive:hive:drwxr-x—

 

 

hdfs superuser has access to raw namespace however he can only see encrypted contents not the actual data.

 

[hdfs@sandbox conf]$ whoami

hdfs

[hdfs@sandbox conf]$ hadoop fs -cat /.reserved/raw/secureweblogs/web.log

▒5     ▒0▒7s▒9▒]i▒▒▒

[hdfs@sandbox conf]$

 

 

Pros & Cons

 

Pros:

  1. No need to encrypt entire disk
  2. No application code changes required
  3. Encryption at REST

 

Cons:

  1. No “in-transit” encryption in HDP yet.
  2. The key management is rudimentary and not scalable for production.

 

Group discussion on this

https://www.linkedin.com/grp/post/3638279-5978515661268279299

http://hortonworks.com/community/forums/topic/hdfs-transparent-data-encryption/

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather