Tag : hadoop

Setup and configure Active Directory server for Kerberos

In this tutorial we will see how to setup and configure Active Directory server for Kerberos authentication on HDP cluster.

Environment details used to setup and configure active directory server for kerberos.

Windows server –  2002 r2.

HDP Cluster – 2.6.X

Ambari – 2.5.X

Let’s get started! When you have nothing configured, this is how Server Manager would look like:




Step 1 – Configure hostname for your AD.

Let’s begin by configuring relevant hostname for your active directory server.

To change computer name, Open Server Manager –> Click on Local Server in the left pane –> Click on Computer name –> Write Computer description(Optional) –> Click on “Change” button –> Type in your new computer name –> Save the changes and restart your computer (Yeah! It’s frustrating to restart every time you change something in windows, that’s how it is :( )


crazyadmins.com – Change computer name


Step 2 – Configure static IP:

Open Powershell and run “IPConfig /all” to get existing IP address, Gateway and DNS IP addresses. Note down the current configuration.


crazyadmins.com - current_ip_configs

crazyadmins.com – current_ip_configs


Open Control Panel –> Network and Internet –> Network and Sharing Center –> Click on Ethernet –> Click on Properties in Ethernet properties –> Goto Internet protocol version 4 –> Click on properties –> Click on Use following IP addresses and enter information noted in the previous step.

Please refer below screenshot.

crazyadmins.com - change_ip

crazyadmins.com – change_ip


Step 3 – Install DNS

Open Server Manager –> Click on “Add Roles and Features” –> Click Next –> Select “Role-based or feature-based installation” –> Select a server from server-pool (Your AD server’s hostname should get displayed) –> Select “DNS Server” –> Click on “Add Features” –> Click on Next button –> Click on Next button –> Click on Install –> Finally click on Close button once installation is complete.

Please refer series of screenshots for your reference below:














Step 4 – Configure domain controller

Please open Server Manager –> Click on Add Roles and Features under dashboard –> Click on Next –> Select “Role-based or feature-based installation” and Click on Next –> Keep the default selecion and click on Next –> Tick “Active Directory Domain Services” –> Click on “Add Features” –> Click on Next –> Keep the default selection for “Select one or more features to install” –> Click on Next button –> Next –> Click on install at “Confirm installation selections” page –> Once installation is done, you can close the window.

Please refer below screenshots if required:





















Step 5 – Promote the server to a domain controller

Click on Flag icon showing yellow warning sign on top right –> Click on “Promote the server to a domain controller” –> In Deployment configuration, click on “Add a new forest” –> set DSRM administrator password –> Click Next –> Verify NETBIOS and change if needed ( I did not change it in my case ) –> Keep the location of AD DS database/log files to default value –> Review and click Next –> Make sure that all the pre-requisite checks are passed –> Click on Install –> Close the window after installation.

Please refer below screenshots if required:





























Step 6 – Configure LDAPS for AD. First step is to install active directory certificate services.

Please follow below steps to install AD CS:

Click on Server Manager –> Add roles and features –> Next –> On “Select installation type” page, make sure to select Role-based or feature-based installation –> Next –> Select server on destination server page –> Select “Active Directory Certificate services” and click on Add features –> Next –> Next –> Next –> Please ensure that “Certificate Authority” is selected on “Select role services” page –> Next –> Install –> Close.

Please follow below steps to configure AD CS:

Click on Notification Icon on Server Manager Dashboard –> Click on “Configure Active Directory Certificate Services on the Destination Server”  –> Please ensure that the default user is a member of administrator group(Screenshot – Step1) –> Next –> Select “Certificate Authority” on Select Role Services page(Screenshot – Step2) –> Next –> Select “Enterprise CA” on Setup type(Screenshot – Step3) –> Next –> Select “Root CA” on Specify the type of CA page(Screenshot – Step4) –> Next –> Create new private key(Screenshot – Step5) –> Next –> Keep default options for “Cryptography for CA”(Screenshot – Step6) –> Next –> Specify Name of the CA as per your requirement(Screenshot – Step7) –> Next –> Set validity period ( Keep it to default 5 years ) –> Next –> Specify “Certificate Database location” & “Certificate Database log location”(Screenshot – Step8)–> Click on Configure. –> Close (Screenshot – Step9)

Please refer below screenshots:

Step 1:



Step 2:



Step 3:



Step 4:



Step 5:



Step 6:



Step 7:




Step 8:




Step 9:




Step 7: Importing AD certificate to linux host(s)

Install Openldap services:

sudo yum -y install openldap-clients ca-certificates


Add AD certificate to your linux host(s):

openssl s_client -connect <AD-server-FQDN>:636 <<<'' | openssl x509 -out /etc/pki/ca-trust/source/anchors/ad.crt


Update CATrust certificates:

sudo update-ca-trust force-enable
sudo update-ca-trust extract
sudo update-ca-trust check

Configure your AD server to be trusted:

sudo tee -a /etc/openldap/ldap.conf > /dev/null << EOF
TLS_CACERT /etc/pki/tls/cert.pem
URI ldaps://<your-ad-server-fqdn> ldap://<your-ad-server-fqdn>
BASE dc=<your-dc>,dc=<your-dc>


Test connection to AD using openssl client:

openssl s_client -connect <ad-server-fqdn>:636 </dev/null


Step 8: Configure Kerberos with AD using Ambari

Please follow below Hortonworks documentation to configure Kerberos with AD using Ambari.



Please comment if you have any feedback/questions/suggestions. Happy Hadooping!! :)

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather
post image

Hadoop v3 Offerings

This blog will cover some new feature which Hadoop V3 has to offer for existing or new Hadoop customers. And it’s a nice idea to familiarize yourself with these features ,so that incase you want to move to Hadoop or upgrade your cluster from an older version you will be aware what you can try and experiment with your cluster!!!

I will be covering installation and upgrade to Hadoop v3 in separate blogs as this one has a strict focus area towards features of Hv3.

The Overview:

So, let’s have a look at the history of Hadoop version 3 which was released end of last year on 13-December-2017. What a nice Christmas surprise to the community!!! All thanks to the dedicated hard working committees for their dedication to making this happen.

As per Apache Hadoop website, the timeline of v3 version looks like this:

And the progress chart of Hadoop v3 looks like this:

After four alpha releases and one beta release, 3.0.0 is generally available. 3.0.0 consists of 302 bug fixes, improvements, and other enhancements since 3.0.0-beta1. All together, 6242 issues were fixed as part of the 3.0.0 release series since 2.7.0.

If you are more keen on details about the JIRA reported and addressed then you can have a look at the below-provided link:

  1. Changes
  2. Release Notes

The salient features of Hadoop v3:

As we have already taken a look at the history, let me jot down some features introduced as part of this new release :

  1. Minimum required Java version increased from Java 7 to Java 8
  2. Support for erasure coding in HDFS
  3. YARN Timeline Service v.2
  4. Shell script rewrite
  5. Shaded client jars
  6. Support for Opportunistic Containers and Distributed Scheduling
  7. MapReduce task-level native optimization
  8. Support for more than 2 NameNodes
  9. Default ports of multiple services have been changed
  10. Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
  11. Intra-data node balancer
  12. Reworked daemon and task heap management
  13. S3Guard: Consistency and Metadata Caching for the S3A filesystem client
  14. HDFS Router-Based Federation
  15. The API-based configuration of Capacity Scheduler queue configuration
  16. YARN Resource Types

Now, I will be cover details of the features which are part of my favourite list and would help readers to understand it technically. Note: At this point, I can’t cover in-depth details of each feature as this will make blog clumsy and boring which I don’t want at all.

  1. Hadoop Erasure Coding: Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.To understand more about this feature you can refer to the listed link:
  2. Namenode HA with more than 2 nodes: In this feature, a customer can have more than two name nodes as an Active/Passive node. In the earlier release, we had HA name node which is an Active/Passive method of implementation with only one name node failure tolerance. In this new feature to achieve the higher degree of tolerance, a customer can implement HA for name node with having more than two name nodes and quorum general manager for fencing.
  3. Changes in default ports of multiple services:  With this feature Hadoop services such as NameNode, Secondary NameNode, DataNode, and KMS ports are now moved out of Linux ephemeral port range (32768-61000). In earlier version having these services ports in ephemeral port range sometimes conflicts with other application and create a problem in service startups.
  4. Intra-data node balancer: Remember the below command for balancing the Hadoop cluster when we add new data nodes to our cluster or to achieve more admin specific tasks in the cluster. However, adding or replacing disks can lead to significant skew within a DataNode. This situation was not handled by the earlier version of Hadoop HDFS balancer utility, which concerns itself with inter-, not intra-, DN skew.  In the new feature, this is been taken care and can handle inter-balancing in data nodes.

    hdfs balancer

  5. HDFS Router-Based Federation: HDFS Router-Based Federation adds an RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table, is managed on the server-side by the routing layer rather than on the client.
  6. Yarn Timeline v2 service: Timeline v2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation which were lacking in the earlier version.
  7. Yarn Resources types: In this feature user defined countable resources is enabled using which a Hadoop cluster admin can define the countable resources like  GPU, S/W licenses or locally attached storage. This also includes the CPU and memory which was part of earlier releases.

Tools/Information used for writing this blog:

  1. Sketching
  2. Timeline Graphs
  3. Information extracted from Apache Hadoop


Please feel free to comment if you need any further help on this. Happy Hadooping!!  :)

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Tune Hadoop Cluster to get Maximum Performance (Part 2)

In previous part we have seen that how can we tune our operating system to get maximum performance for Hadoop, in this article I will be focusing on how to tune hadoop cluster to get performance boost on hadoop level :-)



Tune Hadoop Cluster to get Maximum Performance (Part 2) – http://crazyadmins.com


Before I actually start explaining tuning parameters let me cover some basic terms that are required to understand Hadoop level tuning.


What is YARN?

YARN – Yet another resource negotiator, this is Map-reduce version 2 with many new features such as dynamic memory assignment for mappers and reducers rather than having fixed slots etc.


What is Container?

Container represents allocated Resources like CPU, RAM etc. It’s a JVM process, in YARN AppMaster, Mapper and Reducer runs inside the Container.


Let’s get into the game now:


1. Resource Manager (RM) is responsible for allocating resources to mapreduce jobs.

2. For brand new Hadoop cluster (without any tuning) resource manager will get 8192MB (“yarn.nodemanager.resource.memory-mb”) memory per node only.

3. RM can allocate up to 8192 MB (“yarn.scheduler.maximum-allocation-mb”) to the Application Master container.

4. Default minimum allocation is 1024 MB (“yarn.scheduler.minimum-allocation-mb”).

5. The AM can only negotiate resources from Resource Manager that are in increments of (“yarn.scheduler.minimum-allocation-mb”) & it cannot exceed (“yarn.scheduler.maximum-allocation-mb”).

6. Application Master Rounds off (“mapreduce.map.memory.mb“) & (“mapreduce.reduce.memory.mb“) to a value devisable by (“yarn.scheduler.minimum-allocation-mb“).


What are these properties ? What can we tune ?



Default value is 1024m

Sets the minimum size of container that YARN will allow for running mapreduce jobs.




Default value is 8192m

The largest size of container that YARN will allow us to run the Mapreduce jobs.




Default value is 8GB
Total amount of physical memory (RAM) for Containers on worker node.

Set this property= Total RAM – (RAM for OS + Hadoop Daemons + Other services)




Default value is 2.1

The amount of virtual memory that each Container is allowed

This can be calculated with: containerMemoryRequest*vmem-pmem-ratio




These are the hard limits enforced by Hadoop on each mapper or reducer task. (Maximum memory that can be assigned to mapper or reducer’s container)

Default value – 1GB




The heapsize of the jvm –Xmx for the mapper or reducer task.

This value should always be lower than mapreduce.[map|reduce].memory.mb.

Recommended value is 80% of mapreduce.map.memory.mb/ mapreduce.reduce.memory.mb




The amount of memory for ApplicationMaster




heapsize for application Master




The number of cores that a node manager can allocate to containers is controlled by the yarn.nodemanager.resource.cpu-vcores property. It should be set to the total number of cores on the machine, minus a core for each daemon process running on the machine (datanode, node manager, and any other long-running processes).




Default value – 100MB

This is very important property to tune, when map task is in progress it writes output into a circular in-memory buffer. The size of this buffer is fixed and determined by io.sort.mb property

When this circular in-memory buffer gets filled (mapreduce.map. sort.spill.percent: 80% by default), the SPILLING to disk will start (in parallel using a separate thread). Notice that if the splilling thread is too slow and the buffer is 100% full, then the map cannot be executed and thus it has to wait.




Hadoop uses buffer size of 4KB by default for its I/O operations, we can increase it to 128K in order to get good performance and this value can be increased by setting io.file.buffer.size= 131072 (value in bytes) in core-site.xml




Short-circuit reads – When reading a file from HDFS, the client contacts the datanode and the data is sent to the client via a TCP connection. If the block being read is on the same node as the client, then it is more efficient for the client to bypass the network and read the block data directly from the disk.

We can enable short-circuit reads by setting this property to “true”




Default value is 10.

Now imagine the situation where map task is running, each time the memory buffer reaches the spill threshold, a new spill file is created, after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file.

The configuration property mapreduce.task.io.sort.factor controls the maximum number of streams to merge at once.




Default value is 5

The map output file is sitting on the local disk of the machine that ran the map task

The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes

The reduce task has a small number of copier threads so that it can fetch map outputs in parallel.

The default is five threads, but this number can be changed by setting the mapreduce.reduce.shuffle.parallelcopies property




I tried my best to cover as much as I can, there are plenty of things you can do for tuning! I hope this article was helpful to you. What I recommend you guys is try tuning above properties by considering total available memory capacity, total number of cores etc. and run the Teragen, Terasort etc. benchmarking tool to get the results, try tuning until you get best out of it!! :-)


facebooktwittergoogle_plusredditpinterestlinkedinmailby feather