Hadoop v3 Offerings
This blog will cover some new feature which Hadoop V3 has to offer for existing or new Hadoop customers. And it’s a nice idea to familiarize yourself with these features ,so that incase you want to move to Hadoop or upgrade your cluster from an older version you will be aware what you can try and experiment with your cluster!!!
I will be covering installation and upgrade to Hadoop v3 in separate blogs as this one has a strict focus area towards features of Hv3.
So, let’s have a look at the history of Hadoop version 3 which was released end of last year on 13-December-2017. What a nice Christmas surprise to the community!!! All thanks to the dedicated hard working committees for their dedication to making this happen.
As per Apache Hadoop website, the timeline of v3 version looks like this:
And the progress chart of Hadoop v3 looks like this:
After four alpha releases and one beta release, 3.0.0 is generally available. 3.0.0 consists of 302 bug fixes, improvements, and other enhancements since 3.0.0-beta1. All together, 6242 issues were fixed as part of the 3.0.0 release series since 2.7.0.
If you are more keen on details about the JIRA reported and addressed then you can have a look at the below-provided link:
The salient features of Hadoop v3:
As we have already taken a look at the history, let me jot down some features introduced as part of this new release :
- Minimum required Java version increased from Java 7 to Java 8
- Support for erasure coding in HDFS
- YARN Timeline Service v.2
- Shell script rewrite
- Shaded client jars
- Support for Opportunistic Containers and Distributed Scheduling
- MapReduce task-level native optimization
- Support for more than 2 NameNodes
- Default ports of multiple services have been changed
- Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
- Intra-data node balancer
- Reworked daemon and task heap management
- S3Guard: Consistency and Metadata Caching for the S3A filesystem client
- HDFS Router-Based Federation
- The API-based configuration of Capacity Scheduler queue configuration
- YARN Resource Types
Now, I will be cover details of the features which are part of my favourite list and would help readers to understand it technically. Note: At this point, I can’t cover in-depth details of each feature as this will make blog clumsy and boring which I don’t want at all.
- Hadoop Erasure Coding: Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.To understand more about this feature you can refer to the listed link:
- Namenode HA with more than 2 nodes: In this feature, a customer can have more than two name nodes as an Active/Passive node. In the earlier release, we had HA name node which is an Active/Passive method of implementation with only one name node failure tolerance. In this new feature to achieve the higher degree of tolerance, a customer can implement HA for name node with having more than two name nodes and quorum general manager for fencing.
- Changes in default ports of multiple services: With this feature Hadoop services such as NameNode, Secondary NameNode, DataNode, and KMS ports are now moved out of Linux ephemeral port range (32768-61000). In earlier version having these services ports in ephemeral port range sometimes conflicts with other application and create a problem in service startups.
- Intra-data node balancer: Remember the below command for balancing the Hadoop cluster when we add new data nodes to our cluster or to achieve more admin specific tasks in the cluster. However, adding or replacing disks can lead to significant skew within a DataNode. This situation was not handled by the earlier version of Hadoop HDFS balancer utility, which concerns itself with inter-, not intra-, DN skew. In the new feature, this is been taken care and can handle inter-balancing in data nodes.
- HDFS Router-Based Federation: HDFS Router-Based Federation adds an RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table, is managed on the server-side by the routing layer rather than on the client.
- Yarn Timeline v2 service: Timeline v2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation which were lacking in the earlier version.
- Yarn Resources types: In this feature user defined countable resources is enabled using which a Hadoop cluster admin can define the countable resources like GPU, S/W licenses or locally attached storage. This also includes the CPU and memory which was part of earlier releases.
Tools/Information used for writing this blog: