In this article we will discuss how to set up Hortonworks Hadoop cluster in AWS (Amazon Web Services).
Assuming you have a valid AWS login let us get started with:
Launching an Amazon instance
Pre-requisites for setting up Hadoop cluster in AWS
Hadoop cluster Installation (via Ambari)
1. Launching an Amazon instance
a) Select EC2 from you Amazon Management Console.
b) Next step is to create the instance. Click on “Launch Instance”
c) We are going to use Centos 6.5 from the AWS Marketplace
d) Select Instance type. For this exercise we will use m3.xlarge (Please select appropriate instance as per your requirement and budget from http://aws.amazon.com/ec2/pricing/)
e) Now configure Instance details. We chose to launch 2 instances and rest all default values as below.
f) Add storage. This will be used for your / volume.
g) Tag your instance
h) Configure security group now. For hadoop we need TCP and ICMP ports and all http ports for various UIs to be open. Thus for this exercise we open all TCP and ICMP ports but this can be restricted to only required ports.
i) ICMP port was added later after launching instance. Yes, this is doable! You can edit the security group settings later as well.
j) Click on “Review and Launch” now.
k) It will ask you to create a new keypair and download it (.pem file) to your local machine. This is used to setup passwordless ssh in hadoop cluster and also required by the management UI.
l) Once your instances are launched, on your EC2 Dashboard you will see the details as below. You can rename you instances here for you reference but these won’t be the hostnames of your instances J
m) Please note down the Public DNS, Public IP, Private IP of your instances since these will be required later.
With this we complete the first part of “Launching an Amazon instance” successfully!
2. Pre-requisites for setting up Hadoop cluster in AWS
Below listed items are very necessary before we move ahead with setting up Hadoop cluster.
Generate .ppk (private key) from the previously download .pem file
1. Use puttygen to do this. Import the .pem file. Generate the keys and save the private key (.ppk).
2. Now connect to putty and use the public DNS/public IP to connect to master host. Don’t forget to import the .ppk (private key) in putty.
3. Login as “root”
Password-less SSH access among servers in cluster:
1. Use WinSCP utility copy the .pem file to your master host. This is used for passwordless SSH from master to slave nodes for starting services remotely.
2. While connecting from WinSCP to master host – provide the Public DNS/Public IP and username as “root”. Instead of password pass the .ppk for connecting by clicking on the advanced button.
3. Public key for this is already put in ~/.ssh/authorized_keys by AWS while launching your instances.
4. chmod 644 ~/.ssh/authorized_keys
5. chmod 400 ~/.ssh/<your .pem file>
6. Once you are logged in run the below 2 commands on master:
6.1 eval `ssh-agent` (this is tilde sign and not single quote!)
6.2 ssh-add ~/.ssh/<your .pem file>
7. Check now whether you can ssh to you slave node from your master without password. It should work.
8. Please remember this will be lost upon shell exit and you have repeat ssh-agent and ssh-add commands.
1. Here you can change the hostname to the Public DNS or something you like and is easy to remember. We will give something that is easy to remember because once you stop your EC2 instance, its Public DNS and Public IP changes. So this will cause you extra work of updating hosts file every-time and also disrupt your cluster state on startup every-time.
2. If you wish to grant public access to your host then chose updating hostname to Public DNS/IP.
3. Issue command : hostname <chose your hostname>
4. Repeat above step on all hosts!
Update /etc/hosts file
1. Edit /etc/hosts using “vi” or any other editor and add the mapping of Private IP and hostname given above. (Private IP is obtained from ifconfig or even on AWS EC2 console-instances details as noted earlier)
2. Update this file to have IP and hostname of all your servers in cluster
3. Repeat above two steps on all hosts!
Date/Time should be in sync
Check that your master and slave nodes’ time and date are in sync else please configure NTP to do so.
Install Java on master
yum install java-1.7.0-openjdk
1. Ensure that in /etc/selinux/config file, “SELINUX=disabled”
Note : Please Repeat above step on all hosts!
Firewall should NOT be running
1. service iptables stop
2. chkconfig iptables off (This is to ensure that it does not start again on reboot)
The address and the base port on which the dfs NameNode Web UI will listen.
3. scp core-site.xml and hdfs-site.xml to slave.hadoop.com at /etc/hadoop/conf/
[root@slave ~]# sudo service hadoop-yarn-resourcemanager start
starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-slave.hadoop.com.out
Started Hadoop resourcemanager: [ OK ]
[root@slave ~]# sudo service hadoop-yarn-nodemanager start
starting nodemanager, logging to /var/log/hadoop-yarn/yarn-yarn-nodemanager-slave.hadoop.com.out
Started Hadoop nodemanager: [ OK ]
[root@master conf]# sudo service hadoop-mapreduce-historyserver start
starting historyserver, logging to /var/log/hadoop-mapreduce/mapred-mapred-historyserver-master.hadoop.com.out
15/04/25 02:56:01 INFO hs.JobHistoryServer: STARTUP_MSG:
STARTUP_MSG: Starting JobHistoryServer
STARTUP_MSG: host = master.hadoop.com/192.168.111.130
STARTUP_MSG: args = 
STARTUP_MSG: version = 2.6.0-cdh5.4.0
STARTUP_MSG: classpath =
STARTUP_MSG: build = http://github.com/cloudera/hadoop -r c788a14a5de9ecd968d1e2666e8765c5f018c271; compiled by 'jenkins' on 2015-04-21T19:18Z
STARTUP_MSG: java = 1.7.0_79
Started Hadoop historyserver: [ OK ]
Hey Guys hope you all are doing well today I’m going to explain you how to install and configure transparent data encryption in hadoop – HDP2.2.
Why do we need Transparent Encryption in HDFS?
When we want to encrypt only selected files/directories in HDFS to save on overhead and protect performance – now this is possible with HDFS Transparent Data Encryption (TDE).
HDFS TDE allows users to take advantage of HDFS native data encryption without any application code changes.
Once an HDFS admin sets up encryption, HDFS takes care of the actual encryption/decryption without the end-user having to manually encrypt/decrypt a file.
Building blocks of TDE (Transparent Data encryption)
Encryption Zone (EZ): An HDFS admin creates an encryption zone and links it to an empty HDFS directory and an encryption key. Any files put in the directory are automatically encrypted by HDFS.
Key Management Server (KMS): KMS is responsible for storing encryption key. KMS provides a REST API and access control on keys stored in the KMS.
Key Provider API: The Key Provider API is the glue used by HDFS Name Node and Client to connect with the Key Management Server.
Accessing data within an encryption zone
EZ Key – Encrypted zone key
DEK – Data encrypted keys
EDED – Encrypted DEK
When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone’s key. The EDEK is then stored persistently as part of the file’s metadata on the NameNode.
When reading a file within an encryption zone, the NameNode provides the client with the file’s EDEK and the encryption zone key version used to encrypt the EDEK. The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. Assuming that is successful, the client uses the DEK to decrypt the file’s contents.
All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.
Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys. However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.
Use Cases of TDE
Data encryption is required by a number of different government, financial, and regulatory entities. For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions.
Step 1: Configure the Key Management Service (KMS)
Extract the Key Management Server bits from the package included in Apache Hadoop