Tag : hadoop-data-encryption

Install and Configure Transparent Data Encryption in hadoop HDP 2.2

Hey Guys hope you all are doing well :-) today I’m going to explain you how to install and configure transparent data encryption in hadoop – HDP2.2.


Why do we need Transparent Encryption in HDFS?


  • When we want to encrypt only selected files/directories in HDFS to save on overhead and protect performance – now this is possible with HDFS Transparent Data Encryption (TDE).


  • HDFS TDE allows users to take advantage of HDFS native data encryption without any application code changes.


  • Once an HDFS admin sets up encryption, HDFS takes care of the actual encryption/decryption without the end-user having to manually encrypt/decrypt a file.



Building blocks of TDE (Transparent Data encryption)


  1. Encryption Zone (EZ): An HDFS admin creates an encryption zone and links it to an empty HDFS directory and an encryption key. Any files put in the directory are automatically encrypted by HDFS.


  1. Key Management Server (KMS): KMS is responsible for storing encryption key. KMS provides a REST API and access control on keys stored in the KMS.


  1. Key Provider API: The Key Provider API is the glue used by HDFS Name Node and Client to connect with the Key Management Server.


Accessing data within an encryption zone


EZ Key – Encrypted zone key

DEK – Data encrypted keys

EDED – Encrypted DEK


  • When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone’s key. The EDEK is then stored persistently as part of the file’s metadata on the NameNode.


  • When reading a file within an encryption zone, the NameNode provides the client with the file’s EDEK and the encryption zone key version used to encrypt the EDEK. The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. Assuming that is successful, the client uses the DEK to decrypt the file’s contents.


  • All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.


  • Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys. However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.



Use Cases of TDE


  • Data encryption is required by a number of different government, financial, and regulatory entities. For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.


  • Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions.


Installation steps:


Step 1: Configure the Key Management Service (KMS)


Extract the Key Management Server bits from the package included in Apache Hadoop


# mkdir -p /usr/kms-demo# cp /usr/hdp/current/hadoop-client/mapreduce.tar.gz /usr/kms-demo/# export KMS_ROOT=/usr/kms-demo

Where KMS_ROOT refers to the directory where mapreduce.tar.gz has been extracted (/usr/kms-demo)

# cd $KMS_ROOT# tar -xvf mapreduce.tar.gz


Start the Key Management Server

Appendix A covers advanced configuration of the Key Management Server. The following basic scenario uses the default configurations:

# cd $KMS_ROOT/hadoop/sbin/# ./kms.sh run

You’ll see the following console output on a successful start: 

Jan 10, 2015 11:07:33 PM org.apache.catalina.startup.Catalina startINFO: Server startup in 1764 ms


Step 2: Configure Hadoop to use the KMS as the key provider

Hadoop configuration can be managed through either Ambari or through manipulating the XML configuration files. Both options are shown here.

Configure Hadoop to use KMS using Ambari

You can use Ambari to configure this in the HDFS configuration section.

Login in to Ambari through your web browser (admin/admin):



On the Ambari Dashboard, click HDFS service and then the “Configs” tab.




Add the following custom properties for HADOOP key management and HDFS encryption zone feature to find the right KMS key provider:


  •       Custom core-site

Add property “hadoop.security.key.provider.path”  with value “kms://http@localhost:16000/kms” 




Make sure to match the host of the node where you started KMS to the value in kms://http@localhost:16000/kms

Step 3: Save the configuration and restart HDFS after setting these properties.


Create Encryption Keys

Log into the Sandbox as the hdfs superuser. Run the following commands to create a key named “key1″ with length of 256 and show the result:

# su – hdfs

# hadoop key create key1  -size 256

# hadoop key list –metadata

As an Admin, Create an Encryption Zone in HDFS


Run the following commands to create an encryption zone under /secureweblogs with zone key named “key1” and show the results:

# hdfs dfs -mkdir /secureweblogs

# hdfs crypto -createZone -keyName key1 -path /secureweblogs

# hdfs crypto -listZones

Note: Crypto command requires the HDFS superuser privilege

Please refer to Implementation and testing section to see how to test our setup.



Implementation & Testing


Below are the results of implementation & testing of TDE on hortonworks sandbox!

Details – Installed and configured KMS, Created 2 encrypted zones on hdfs


  1. /Secureweblogs with EZ key key1
  2. /Huezone with EZ key hueKey



[hdfs@sandbox conf]$ hdfs crypto -listZones

/secureweblogs key1

/Huezone       hueKey



As the ‘hive’ user, you can transparently write data to that directory.


Output :

[hive@sandbox ~]# hdfs dfs -copyFromLocal web.log /secureweblogs

[hive@sandbox ~]# hdfs dfs -ls /secureweblogs

Found 1 items

-rw-r–r–   1 hive hive       1310 2015-01-11 23:28 /secureweblogs/web.log


As the ‘hive’ user, you can transparently read data from that directory, and verify that the exact file that was loaded into HDFS is readable in its unencrypted form.



[hive@sandbox ~]# hdfs dfs -copyToLocal /secureweblogs/web.log read.log

[hive@sandbox ~]# diff web.log read.log

[hive@sandbox conf]$ hadoop fs -cat /secureweblogs/web.log

this is web log

[hive@sandbox conf]$


Other users will not be able to write data or read from the encrypted zone:


[root@sandbox conf]# hadoop fs -cat /secureweblogs/web.log

cat: Permission denied: user=root, access=EXECUTE, inode=”/secureweblogs”:hive:hive:drwxr-x—



hdfs superuser has access to raw namespace however he can only see encrypted contents not the actual data.


[hdfs@sandbox conf]$ whoami


[hdfs@sandbox conf]$ hadoop fs -cat /.reserved/raw/secureweblogs/web.log

▒5     ▒0▒7s▒9▒]i▒▒▒

[hdfs@sandbox conf]$



Pros & Cons



  1. No need to encrypt entire disk
  2. No application code changes required
  3. Encryption at REST



  1. No “in-transit” encryption in HDP yet.
  2. The key management is rudimentary and not scalable for production.


Group discussion on this



facebooktwittergoogle_plusredditpinterestlinkedinmailby feather