Tag : arti-wadhwani

post image

Hadoop v3 Offerings

This blog will cover some new feature which Hadoop V3 has to offer for existing or new Hadoop customers. And it’s a nice idea to familiarize yourself with these features ,so that incase you want to move to Hadoop or upgrade your cluster from an older version you will be aware what you can try and experiment with your cluster!!!

I will be covering installation and upgrade to Hadoop v3 in separate blogs as this one has a strict focus area towards features of Hv3.

The Overview:

So, let’s have a look at the history of Hadoop version 3 which was released end of last year on 13-December-2017. What a nice Christmas surprise to the community!!! All thanks to the dedicated hard working committees for their dedication to making this happen.

As per Apache Hadoop website, the timeline of v3 version looks like this:

And the progress chart of Hadoop v3 looks like this:

After four alpha releases and one beta release, 3.0.0 is generally available. 3.0.0 consists of 302 bug fixes, improvements, and other enhancements since 3.0.0-beta1. All together, 6242 issues were fixed as part of the 3.0.0 release series since 2.7.0.

If you are more keen on details about the JIRA reported and addressed then you can have a look at the below-provided link:

  1. Changes
  2. Release Notes

The salient features of Hadoop v3:

As we have already taken a look at the history, let me jot down some features introduced as part of this new release :

  1. Minimum required Java version increased from Java 7 to Java 8
  2. Support for erasure coding in HDFS
  3. YARN Timeline Service v.2
  4. Shell script rewrite
  5. Shaded client jars
  6. Support for Opportunistic Containers and Distributed Scheduling
  7. MapReduce task-level native optimization
  8. Support for more than 2 NameNodes
  9. Default ports of multiple services have been changed
  10. Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
  11. Intra-data node balancer
  12. Reworked daemon and task heap management
  13. S3Guard: Consistency and Metadata Caching for the S3A filesystem client
  14. HDFS Router-Based Federation
  15. The API-based configuration of Capacity Scheduler queue configuration
  16. YARN Resource Types

Now, I will be cover details of the features which are part of my favourite list and would help readers to understand it technically. Note: At this point, I can’t cover in-depth details of each feature as this will make blog clumsy and boring which I don’t want at all.

  1. Hadoop Erasure Coding: Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.To understand more about this feature you can refer to the listed link:
  2. Namenode HA with more than 2 nodes: In this feature, a customer can have more than two name nodes as an Active/Passive node. In the earlier release, we had HA name node which is an Active/Passive method of implementation with only one name node failure tolerance. In this new feature to achieve the higher degree of tolerance, a customer can implement HA for name node with having more than two name nodes and quorum general manager for fencing.
  3. Changes in default ports of multiple services:  With this feature Hadoop services such as NameNode, Secondary NameNode, DataNode, and KMS ports are now moved out of Linux ephemeral port range (32768-61000). In earlier version having these services ports in ephemeral port range sometimes conflicts with other application and create a problem in service startups.
  4. Intra-data node balancer: Remember the below command for balancing the Hadoop cluster when we add new data nodes to our cluster or to achieve more admin specific tasks in the cluster. However, adding or replacing disks can lead to significant skew within a DataNode. This situation was not handled by the earlier version of Hadoop HDFS balancer utility, which concerns itself with inter-, not intra-, DN skew.  In the new feature, this is been taken care and can handle inter-balancing in data nodes.

    hdfs balancer

  5. HDFS Router-Based Federation: HDFS Router-Based Federation adds an RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table, is managed on the server-side by the routing layer rather than on the client.
  6. Yarn Timeline v2 service: Timeline v2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation which were lacking in the earlier version.
  7. Yarn Resources types: In this feature user defined countable resources is enabled using which a Hadoop cluster admin can define the countable resources like  GPU, S/W licenses or locally attached storage. This also includes the CPU and memory which was part of earlier releases.

Tools/Information used for writing this blog:

  1. Sketching
  2. Timeline Graphs
  3. Information extracted from Apache Hadoop


Please feel free to comment if you need any further help on this. Happy Hadooping!!  :)

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Setup SQL Based authorization in hive

In this tutorial we will see how to setup SQL Based authorization in hive.


Step 1 – Goto ambari UI and add/modify below properties


Goto service hive → configs and change autherization to SQLStdAuth



Step 2 – In Hive-site.xml, make sure you have set below properties:


hive.server2.enable.doAs --> false
hive.users.in.admin.role --> root (comma separated list of users)


Step 3 – Make sure that you have below properties set in Hiveserver2-site.xml


hive.security.authorization.manager --> org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
hive.security.authorization.enabled --> true
hive.security.authenticator.manager --> org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator


Step 4 – Restart hive services from ambar UI


Step 5 – Testing our setup, in this we will create one readonly user and try to drop table.


5.1 Login to beeline using root ( as we have added root in hive.users.in.admin.role )


0: jdbc:hive2://localhost:10010> !connect jdbc:hive2://localhost:10010Connecting to jdbc:hive2://localhost:10010
Enter username for jdbc:hive2://localhost:10010: root
Enter password for jdbc:hive2://localhost:10010: **** 
Connected to: Apache Hive (version Hive JDBC (version isolation: 
TRANSACTION_REPEATABLE_READ1: jdbc:hive2://localhost:10010>


5.2 Now by default there is only one role – public, you need to run below command to set your role as ADMIN.


0: jdbc:hive2://localhost:10010> SHOW CURRENT ROLES;
|  role  |
| public  |


5.3 Set role as admin for user root



1: jdbc:hive2://localhost:10010> set role ADMIN;
No rows affected (0.445 seconds)1: 
jdbc:hive2://localhost:10010> show roles;
|  role  |
| admin  |
| public  |
2 rows selected (0.165 seconds)


5.4 Create new role readonly and add readonly_user in that group


0: jdbc:hive2://slave1.hortonworks.com:10010/> create role readonly;
No rows affected (0.071 seconds)


5.5 Verify that new role has been created successfully


0: jdbc:hive2://slave1.hortonworks.com:10010/> show roles;
|  role  |
| admin  |
| public  |
| readonly  |
3 rows selected (0.051 seconds)
0: jdbc:hive2://slave1.hortonworks.com:10010/>


5.6 Add readonly_user into readonly role


5: jdbc:hive2://slave1.hortonworks.com:10010> grant role readonly to user readonly_user;
No rows affected (0.088 seconds)
5: jdbc:hive2://slave1.hortonworks.com:10010>


5.7 Grant select privileges to role readonly


5: jdbc:hive2://slave1.hortonworks.com:10010> grant select on table batting to role readonly;
No rows affected (0.405 seconds)
5: jdbc:hive2://slave1.hortonworks.com:10010>


5.8 Verify grants for role readonly


0: jdbc:hive2://slave1.hortonworks.com:10010/> show grant role readonly;
| database  |  table  | partition  | column  | principal_name  | principal_type  | privilege  | grant_option  |  grant_time  | grantor  |
| default  | batting  |  |  | readonly  | ROLE  | SELECT  | false  | 1447877696000  | root  |
1 row selected (0.06 seconds)


5.9 Now login to beeline by user readonly_user and try to drop table batting


beeline> !connect jdbc:hive2://slave1.hortonworks.com:10010/
Connecting to jdbc:hive2://slave1.hortonworks.com:10010/
Enter username for jdbc:hive2://slave1.hortonworks.com:10010/: readonly_user
Enter password for jdbc:hive2://slave1.hortonworks.com:10010/: ********
Connected to: Apache Hive (version Hive JDBC (version
0: jdbc:hive2://slave1.hortonworks.com:10010/> drop table batting;
Error: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: Principal [name=readonly_user, type=USER] does not have following privileges for operation DROPTABLE [[OBJECT OWNERSHIP] on Object [type=TABLE_OR_VIEW, name=default.batting]] (state=42000,code=40000)


Note – we are getting an error here because readonly_user does not have permission to drop table batting!


5.10 Let’s try to access some rows from table batting


0: jdbc:hive2://slave1.hortonworks.com:10010/> select * from batting limit 5;
| batting.player_id  | batting.year  | batting.runs  |
| playerID  | NULL  | NULL  |
| aardsda01  | 2004  | 0  |
| aardsda01  | 2006  | 0  |
| aardsda01  | 2007  | 0  |
| aardsda01  | 2008  | 0  |
5 rows selected (0.775 seconds)
0: jdbc:hive2://slave1.hortonworks.com:10010/>


We can see that grants are working and user can see the contents but cannot delete the table.


Please post comments if you need any help! :-)


facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Developing custom maven plugin using Java5 annotations

Maven provides lots of built-in plugins for developers but at some point you may find need of custom maven plugin. Developing custom maven plugin using java5 annotations is very simple and straightforward.




You just need to follow below steps to develop custom maven plugin using Java 5 annotations:




1. Create a new project with pom packaging set to “maven-pom”


2. Add below dependencies to your plugin pom:


i.Maven-plugin-api dependency helps for developing mojos required by custom maven plugin.



ii. Since Maven 3.0 version we can use java 5 annotations to develop custom plugin.with annotations it is not necessasry that mojo super class should be in the same project if your super class also uses annotations. To use annotations in mojos add below dependency to your plugin pom file.



iii.Below dependency is used to not only read Maven project object model files, but to assemble inheritence and to retrieve remote models as required.



iv. If you want to add any test cases or any other 3rd party dependencies add them.


3.Maven plugin tools looks classes with @Mojo annotation any class annotated with @Mojo will be added to plugin configuration file.




import org.apache.maven.plugin.AbstractMojo;
import org.apache.maven.plugin.MojoExecutionException;
import org.apache.maven.plugin.MojoFailureException;
import org.apache.maven.plugins.annotations.Mojo;
public class CustomMojo extends AbstractMojo{
            public void execute() throws MojoExecutionException, MojoFailureException {
            getLog().info("Successfully created custom maven plugin");
                        * your businees logic goes here


The “name” parameter of mojo annotation is your plugins name,your plugin will be recognised with this name.You mojo class extends AbstractMojo class.AbstractMojo class implements mojo interface and set logging for your plugin.AbstractMojo sets log4j based logging getLog() method provides info,error,debug,warn levels of logging.


Mojo interface is having execute method which will contain business logic of plugin.execute method throws 2 kinds of execptions:


i. MojoFailureException : If expected probelm occurs throwing this exception causes Build Failure message to be displayed.Throwing this exception causes build failure.


ii. MojoExecutionException : If unexcepted probelm occurs throwing this exception causes Build error message to be displayed.


4.You can execute your plugin from command line by providing following command:


mvn pluginGroupId:artifactID:version:mojoName

to shorten the command to be executed for plugin add below lines to maven’s settings.xml file in pluginGroups section. This will tell maven to search repository for this groupID:

<pluginGroups><pluginGroup>plugin group id</pluginGroup></pluginGroups>

After this you can run your plugin simply by providing goal prefix and mojo name command to run plugin will be like this:

mvn goalPrefix:mojoName


5. configuring goalPrefix:

To Create goalPrefix add plugin maven-plugin-plugin to maven plugin pom.It is used to create plugin descriptor for any mojo’s found in source tree to include in jar.it can be used for generating report files for mojo’s updating plugin registry.


                                                            <goalPrefix>your goalPrefix</goalPrefix>
                                                            <parameter1>custom param1</parameter1>
                                                            <parameter2>custom param2</parameter2>


6. You can pass external parameters to your plugin from command line and also you can set default values for you parameters if they are not send from command line.



@Parameter(property = "param1", defaultValue = "abc")
            private String type;


command to run plugin by passing parameter is :

mvn goalPrefix:mojoName -Dparam1='acd';


if you set required parameter of property to false then there is no compulsion of passing parameter from command line.

As we all know that maven has default structure of scanning source files in src/main/java structure and test files in src/test folded,similarly if you want your plugin to scan files in particular folder in your project structure then you can do this by adding org.apache.maven.project.MavenProject property

to your project.


            @Parameter(defaultValue = "${project}", readonly = true, required = true)
            private MavenProject project;


you can also set default values to your parameters by setting parameter property name tag in maven-plugin-plugin plugin’s configuration section.if you don’t want to set default property then keep these custom property fields blank.these property fields are the property values you mentioned in mojo.


9. Using your plugin in main project: for using your plugin in another project add plugins dependency in build section of your project.



           <groupId>plugin's group id</groupId>
           <artifactId>>plugin's artifact id</artifactId>
           <version>versin of plugin</version>


10. To release plugin copy plugin’s jar and other dependent jars from your m2 repository and release it to client or qa.


Here we are done with developing custom maven plugin with java 5 annotations.please let me know if you have any doubts or suggestions.The sample project is present on github you can download it from below link:




facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Unable to delete STORM REST API service component after hdp upgrade

Unable to delete STORM REST API service component after hdp upgrade to hdp2.2.0.0 ? Relax! You are at right place, this guide will show you how to handle these kind of errors.




Initially I had installed hdp2.1 with ambari 1.7, then I upgraded ambari to 2.1.2 and upgraded hdp stack to as per this documentation.


As per below note mentioned in the documentation, I had to delete STORM_UI_SERVER component:


“In HDP 2.2, STORM_REST_API component was deleted because the service was moved into STORM_UI_SERVER. When upgrading from HDP 2.1 to 2.2, you must delete this component using the API as follows”


I tried deleting it using Ambari REST APIs:


First stop the component using ambari API command:

curl -u admin:admin -X PUT -H 'X-Requested-By:1' -d '{"RequestInfo":{"context":"Stop Component"},"Body":{"HostRoles":{"state":"INSTALLED"}}}' http://hdpambari.hortonworks.com:8080/api/v1/clusters/c1/hosts/hdpambari.hortonworks.com/host_components/STORM_REST_API


Then Delete using below curl call.

curl -u admin:admin -X DELETE -H 'X-Requested-By:1' http://hdpambari.hortonworks.com:8080/api/v1/clusters/hdpambari/services/STORM/components/STORM_REST_API


Please note in above commands:

admin:admin - my username and password for ambari UI
hdpambari.hortonworks.com:8080 - my hostname where ambari-server is installed and port number of ambari server
hdpambari - my cluster name


Above commands did not work because of an error given below.

"message" : "org.apache.ambari.server.controller.spi.SystemException: An internal system exception occurred: Could not delete service component from cluster. To remove service component, it must be in DISABLED/INIT/INSTALLED/INSTALL_FAILED/UNKNOWN/UNINSTALLED/INSTALLING state


The only option was to remove this component completely from ambari database and restart ambari-server/agent processes.


Here is the short summary of what commands I ran:

[root@hdpambari ~]# psql ambari ambari
Password for user ambari:    #default password is "bigdata"
psql (8.4.20)
"help" for help.
ambari=> delete from hostcomponentstate where component_name='STORM_REST_API';
ambari=> delete from hostcomponentdesiredstate where component_name='STORM_REST_API';
ambari=> delete from servicecomponentdesiredstate where component_name='STORM_REST_API';
ambari=> commit;
WARNING: there is no transaction in progress
ambari=> \q
[root@hdpambari ~]#


This resolved my issue. hope this helps!


facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Writing Custom Gradle Plugin using Java

What is Gradle ?

Gradle is build automation tool based on Apache ant and Maven. Gradle avoids traditional .xml file based configuration by introducing groovy based domain specific language. In gradle project have .gradle files instead of .pom files. Gradle was designed for multi-project builds and supports incremental build.

Gradle plugin groups together reusable pieces of build logic which can be then used across many different projects and builds. We can use any language whose compiled code gets converted to byte code for  developing custom gradle plugin. As gradle is mainly designed using groovy language its very easy to develop gradle plugin using groovy but  lets see how to develop custom gradle plugin using Java language:


Here are the steps :


1. Create new Java project using eclipse or any other IDE.


2. Create plugin folder in your project which will have source code of your custom gradle plugin.


3. create your package structure in this plugin folder.

    Eg: com.sample.gradle

This package structure will have your custom plugins source code.


4. Create Plugin class which will implement Plugin<Project> interface.Plugin is represents extension to gradle.This interface is having apply method which applies configuration to gradle Project object.

Eg :

package com.sample.gradle;
import org.gradle.api.Project;
import org.gradle.api.Plugin;
import com.sample.gradle.SamplePluginExtension;
public class SampleGradlePlugin implements Plugin<Project> {
    public void apply(Project target) {
                                                                SamplePluginExtension.class);    }


5. All the user defined values to custom plugin are provided through extension object so creaete extension class and register it with plugin as shown above to receive inputs from user.If user doesnot provided input then default values will be assumed.


6. Create extension class which is similar to java pojo class it will contain user defined properties and their getter setter methods.If user provides values for these properties during run time then these values will be accepted otherwise default values will be considered.


Eg :

public class SamplePluginExtension {
                private String sampleFilePath="/home/mahendra/abc”;
                public void setSampleFilePath(String sampleFilePath){
                public String getSampleFilePath(){
                                return sampleFilePath;


7. Create task class which will have your plugin logic this task class contains your main plugin logic.This task class extends org.gradle.api.DefaultTask class and defines method with @TaskAction annotation.This method will have actual logic.


package com.sample.gradle;
import org.gradle.api.DefaultTask;
import org.gradle.api.tasks.TaskAction;
import org.gradle.api.tasks.TaskExecutionException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class SampleTask extends DefaultTask {
                private final Logger log = LoggerFactory.getLogger(this.getClass());
                public void samplePluginTasks() throws TaskExecutionException {
                                log.info("Starting  sample task");
                                try {
                                                SamplePluginExtension extension = getProject().getExtensions()
                                                String filePath = extension.getSampleFilePat
                                                log.debug("Sample file path is: {}",filePath
                                                /* Here comes
                                                 * your main logic
                                                log.info("Successfully completed sample Task
                                }catch(Exception e){
                                                throw new TaskExecutionException(this,new Ex
eption("Exception occured while processing sampleTask",e));


For logging into your custom plugin use slf4j or any other logging framework of your choice.If you want to fail the build on exception in your task then throw TaskExecutionException which will cause BuildFailures other exceptions will not cause build failure.TaskExecutionException accepts task object and Throwable object as input.


Here DefaultTask is standard gradle task implementation class and we need to extend it while implementing custom tasks.@TaskAction annotation makes method action method and whenever task executes this method will be executed.


8. Registering plugin class : create resources folder in plugin/src/main folder.Inside resources create META-INF/gradle-plugin folder, in this gradle-plugin folder create properties file.Name of this property file is used for registering plugin in build.gradle build file.


eg: sample-plugin.properties:

In sample-plugin.properties file add following line:


value of implementation-class if path of plugin class.


9. Create settings.gradle file in your plugin folder which will have below line:

rootProject.name = 'customGradlePlugin'

value of rootProject.name is name of your root project.


10. Create build.gradle build file for you plugin in plugin folder as follows:

apply plugin: 'java'
dependencies {
    compile gradleApi()
apply plugin: 'maven-publish'
repositories {
dependencies {
    compile 'org.slf4j:slf4j-simple:1.6.1'
    testCompile 'junit:junit:4.11'
group = 'com.sample.gradle'
version = '1.0.0'
publishing {
                publications {
                                maven(MavenPublication) {
                                                groupId "$group"
                                                artifactId 'CustomGradlePlugin'
                                                version "$version"
                                                from components.java
uploadArchives {
    repositories {

as we are developing gradle plugin using java add apply plugin:’java’ line.whatever external dependencies your plugin is dependent upon add these dependencies in dependencies section.the repositories in which your plugin should look for should be mentioned in repositories section.mentioned group id,version of plugin in group and version tag.

For making plugin available to other projects gradle plugin should be published to repository or its archives should be uploaded for this purpose either use publishing or uploadArchives functionality.


for publishing plugin use following command if you are publishing plugin to local maven repository:

“gradle clean build publishToMavenLocal”


If you are uploading plugin to local maven repository then use below command:

“gradle clean build uploadToArchives”


11. For using plugin in another projects make following changes in build.gradle file of your project.

apply plugin: 'java'
buildscript {
repositories {
    dependencies {
                                classpath "com.sample.gradle:CustomGradlePlugin:1.0.0"
apply plugin: 'sample-plugin'
 task sampleTask(type: com.sample.gradle.SampleTask) {
 samplePlugin.sampleFilePath = "$sampleFilePath"


here plugin dependency must be defined in buildscript section and  to tell gradle which repositories to scan for getting plugin dependencies  add repositories section in buildscript section this repository section must come ahead of dependencies section.Afer this add apply plugin line.

Sample task provided will be used for executing our plugin logic in task value of type is path of our custom task.whatever custom arguments we want to provide to plugin that we need to define in task section .if we want to run with default parameters then comment samplePlugin.sampleFilePath line in task section.

To run plugin with custom parameter use below command:

gradle clean build sampleTask -PsampleFilePath='/home/mahendra/abc.sample'


To run plugin without custom parameter use below command:

gradle clean build sampleTask


Sample plugin project structure is as follows:




Here we are done with developing gradle custom plugin with java.


please let me know if you have any issues or suggestions.The sample project is present on github you can download it from below link:



Please do not forget to comment your feedback on this Article! :-)




facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Kafka integration with Ganglia

In this blog post I will show you kafka integration with ganglia, this is very interesting & important topic for those who want to do bench-marking, measure performance by monitoring specific Kafka metrics via ganglia.

Before going ahead let me briefly explain about what is Kafka and Ganglia.

Kafka – Kafka is open source distributed message broker project developed by Apache, Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds.

Ganglia – Ganglia is distributed system for monitoring high performance computing systems such as grids, clusters etc.


Now lets get started, In this example we have a Hadoop cluster with 3 Kafka brokers, First we will see how to install and configure ganglia on these machines.


Step 1: Setup and Configure Ganglia gmetad and gmond


First thing is you need to install EPEL repo on all the nodes

yum install epel-release

On master node (ganglia-server) download below packages

yum install rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpdphpaprapr-util

On slave nodes (ganglia-client) download below packages

yum install ganglia-gmond

On master node do the following

chown apache:apache -R /var/www/html/ganglia

Edit below config file and allow ganglia webpage from any IP

vi /etc/httpd/conf.d/ganglia.conf

It should look like below:

# Ganglia monitoring system php web frontend
Alias /ganglia /usr/share/ganglia
<Location /ganglia>
Order deny,allow
Allow from all                    #this is very important or else you won’t be able to see ganglia web UI
Allow from
Allow from ::1
# Allow from .example.com

On master node edit gmetadconfig file and it should look like below (Please change highlighted IP address to your ganglia-server private IP address)

#cat /etc/ganglia/gmetad.conf |grep -v ^# 
data_source "hadoopkafka" 
gridname "Hadoop-Kafka"
setuid_username ganglia
case_sensitive_hostnames 0

On master node edit gmond.conf, keep other parameters to default except below ones

Copy gmond.conf to all other nodes in the cluster

cluster {
name = "hadoopkafka"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
/* The host section describes attributes of the host, like the location */
host {
location = "unspecified"
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
#bind_hostname = yes # Highly recommended, soon to be default.
                       # This option tells gmond to use a source address
                       # that resolves to the machine's hostname. Without
                       # this, the metrics may appear to come from any
                       # interface and the DNS names associated with
                       # those IPs will be used to create the RRDs.
#mcast_join =
host =
port = 8649
#ttl = 1
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
#mcast_join =
port = 8649
#bind =
#retry_bind = true
# Size of the UDP buffer. If you are handling lots of metrics you really
# should bump it up to e.g. 10MB or even higher.
# buffer = 10485760

Start apache service on master node

service httpd start

Start gmetad service on master node

service gmetad start

Start gmond service on every node in the server

service gmond start



This is it!  Now you can see basic ganglia metrics by visiting web UI at http://IP-address-of-ganglia-server/ganglia


Step 2: Ganglia Integration with Kafka


Enable JMX Monitoring for Kafka Brokers

In order to get custom Kafka metrics we need to enable JMX monitoring for Kafka Broker Daemon.

To enable JMX Monitoring for Kafka broker, please follow below instructions:

Edit kafka-run-class.sh and modify KAFKA_JMX_OPTS variable like below (please replace red text with your Kafka Broker hostname)

KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=kafka.broker.hostname -Djava.net.preferIPv4Stack=true"

Add below line in kafka-server-start.sh (in case of Hortonworks hadoop, path is /usr/hdp/current/kafka-broker/bin/kafka-server-start.sh)

export JMX_PORT=${JMX_PORT:-9999}


That’s it! Please do the above steps on all Kafka brokers and restart the kafka brokers ( manually or via management UI whatever applicable)


Verify that JMX port has been enabled!

 You can use jconsole to do so.


Download, install and configure jmxtrans

Download jmxtrans rpm from below link and install it using rpm command



Once you have installed jmxtrans, please make sure that java &jps configured in $PATH variable


Write a JSON for fetching MBeans on each Kafka Broker.


I have written JSON for monitoring custom Kafka metrics, please download it from here.

Please note that, you need to replace “IP_address_of_kafka_broker” with your kafka broker’s IP address in downloaded JSON, same is the case for ganglia server’s IP address.


Once you are done with writing JSON, please verify the syntax using any online JSON validator( http://jsonlint.com/ ).


Start the jmxtrans using below command

cd /usr/share/jmxtrans/
sh jmxtrans.sh start $name-of-the-json-file

Verify that jmxtrans has started successfully using simple “ps” command

Repeat above procedure on all Kafka brokers


Verify custom metrics

Login to ganglia server and go to rrd directory ( by default it is /var/lib/ganglia/rrds/ ) and check if there are new rrd files for kafka metrics.


You should see output like below (output is truncated)







Go to ganglia web UI –>  select hadoopkafka from below highlighted dropdown







Select “custom.metrics” from below highlighted dropdown







That’s all! :-)


Note – You can choose which Mbeans to monitor by adding/removing them from json file.


Please share your feedback on this blog via comments! :-)

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Minimum user id error while submitting mapreduce job

Hello everyone! Hope you are enjoying our blogs on crazyadmins.com

This is a small blog to help you all solve this minimum user id error while submitting mapreduce job in Hadoop.



Troubleshoot hadoop problems

crazyadmins.com – We know Hadoop!


Application application_XXXXXXXXX_XXXX failed 2 times due to AM Container for appattempt_ XXXXXXXXX_XXXX _XXXXXX exited with exitCode: -1000
For more detailed output, check application tracking page:http://<your RM host>:8088/proxy/application_ XXXXXXXXX_XXXX /Then, click on links to logs of each attempt.
Diagnostics: Application application_ XXXXXXXXX_XXXX initialization failed (exitCode=255) with output: Requested user hive is not whitelisted and has id 501, which is below the minimum allowed 1000
Failing this attempt. Failing the application.



This means that there is a property (for minimum allowed UID value) which has been set with value 1000. In above example, it appears that the hive user has UID as 501 which is less than 1000.


To solve this, we either need to:


  1. Update UID of user hive to a unique value greater than or equal to 1000


  1. Update the property value to 500 so that hive user UID meets the minimum value.


We will go with option 2 here i.e. Update the property value to 500 so that hive user UID meets the minimum value.

If you are using Ambari to manage your Hortonworks cluster, then:

1. Login to Ambari UI with user having privileges to edit configurations.

2. Navigate to YARN configurations.

3. Go to Advanced yarn-env.sh

4. Update “Minimum user ID for submitting job” to 500


Resubmit your job now and it should just run fine!

Similarly if you are using Cloudera Manager, find the same property in YARN configurations and update it.

If you are not using any of the Management UIs, then you can try finding this property in the conf directory where your hadoop conf files are located. Usually you will find them in /etc/hadoop/conf location.

Enjoy Hadooping!! :-)

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Tune Hadoop Cluster to get Maximum Performance (Part 2)

In previous part we have seen that how can we tune our operating system to get maximum performance for Hadoop, in this article I will be focusing on how to tune hadoop cluster to get performance boost on hadoop level :-)



Tune Hadoop Cluster to get Maximum Performance (Part 2) – http://crazyadmins.com


Before I actually start explaining tuning parameters let me cover some basic terms that are required to understand Hadoop level tuning.


What is YARN?

YARN – Yet another resource negotiator, this is Map-reduce version 2 with many new features such as dynamic memory assignment for mappers and reducers rather than having fixed slots etc.


What is Container?

Container represents allocated Resources like CPU, RAM etc. It’s a JVM process, in YARN AppMaster, Mapper and Reducer runs inside the Container.


Let’s get into the game now:


1. Resource Manager (RM) is responsible for allocating resources to mapreduce jobs.

2. For brand new Hadoop cluster (without any tuning) resource manager will get 8192MB (“yarn.nodemanager.resource.memory-mb”) memory per node only.

3. RM can allocate up to 8192 MB (“yarn.scheduler.maximum-allocation-mb”) to the Application Master container.

4. Default minimum allocation is 1024 MB (“yarn.scheduler.minimum-allocation-mb”).

5. The AM can only negotiate resources from Resource Manager that are in increments of (“yarn.scheduler.minimum-allocation-mb”) & it cannot exceed (“yarn.scheduler.maximum-allocation-mb”).

6. Application Master Rounds off (“mapreduce.map.memory.mb“) & (“mapreduce.reduce.memory.mb“) to a value devisable by (“yarn.scheduler.minimum-allocation-mb“).


What are these properties ? What can we tune ?



Default value is 1024m

Sets the minimum size of container that YARN will allow for running mapreduce jobs.




Default value is 8192m

The largest size of container that YARN will allow us to run the Mapreduce jobs.




Default value is 8GB
Total amount of physical memory (RAM) for Containers on worker node.

Set this property= Total RAM – (RAM for OS + Hadoop Daemons + Other services)




Default value is 2.1

The amount of virtual memory that each Container is allowed

This can be calculated with: containerMemoryRequest*vmem-pmem-ratio




These are the hard limits enforced by Hadoop on each mapper or reducer task. (Maximum memory that can be assigned to mapper or reducer’s container)

Default value – 1GB




The heapsize of the jvm –Xmx for the mapper or reducer task.

This value should always be lower than mapreduce.[map|reduce].memory.mb.

Recommended value is 80% of mapreduce.map.memory.mb/ mapreduce.reduce.memory.mb




The amount of memory for ApplicationMaster




heapsize for application Master




The number of cores that a node manager can allocate to containers is controlled by the yarn.nodemanager.resource.cpu-vcores property. It should be set to the total number of cores on the machine, minus a core for each daemon process running on the machine (datanode, node manager, and any other long-running processes).




Default value – 100MB

This is very important property to tune, when map task is in progress it writes output into a circular in-memory buffer. The size of this buffer is fixed and determined by io.sort.mb property

When this circular in-memory buffer gets filled (mapreduce.map. sort.spill.percent: 80% by default), the SPILLING to disk will start (in parallel using a separate thread). Notice that if the splilling thread is too slow and the buffer is 100% full, then the map cannot be executed and thus it has to wait.




Hadoop uses buffer size of 4KB by default for its I/O operations, we can increase it to 128K in order to get good performance and this value can be increased by setting io.file.buffer.size= 131072 (value in bytes) in core-site.xml




Short-circuit reads – When reading a file from HDFS, the client contacts the datanode and the data is sent to the client via a TCP connection. If the block being read is on the same node as the client, then it is more efficient for the client to bypass the network and read the block data directly from the disk.

We can enable short-circuit reads by setting this property to “true”




Default value is 10.

Now imagine the situation where map task is running, each time the memory buffer reaches the spill threshold, a new spill file is created, after the map task has written its last output record, there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file.

The configuration property mapreduce.task.io.sort.factor controls the maximum number of streams to merge at once.




Default value is 5

The map output file is sitting on the local disk of the machine that ran the map task

The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes

The reduce task has a small number of copier threads so that it can fetch map outputs in parallel.

The default is five threads, but this number can be changed by setting the mapreduce.reduce.shuffle.parallelcopies property




I tried my best to cover as much as I can, there are plenty of things you can do for tuning! I hope this article was helpful to you. What I recommend you guys is try tuning above properties by considering total available memory capacity, total number of cores etc. and run the Teragen, Terasort etc. benchmarking tool to get the results, try tuning until you get best out of it!! :-)


facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Tune Hadoop Cluster to get Maximum Performance (Part 1)

I have been working on production Hadoop clusters for a while and have learned many performance tuning tips and tricks. In this blog I will explain how to tune Hadoop Cluster to get maximum performance. Just installing Hadoop for production clusters or to do some development POC does not give expected results, because default Hadoop configuration settings are done keeping in mind the minimum hardware configuration. Its responsibility of Hadoop Administrator to understand the hardware specs like amount of RAM, Total number of CPU Cores, Physical Cores, Virtual Cores, Understand if hyper threading is supported by Processor, NIC Cards, Number of Disks that are mounted on Datanodes etc.




For Better Understanding I have divided this blog into two main parts.

1. Tune your Hadoop Cluster to get Maximum Performance (Part 1) – In this part I will explain how to tune your operating system in order to get maximum performance for your Hadoop jobs.

2. Tune your Hadoop Cluster to get Maximum Performance (Part 2) – In this part I will explain how to modify your Hadoop configurations parameters so that it should use your hardware very efficiently.


How OS tuning will improve performance of Hadoop?

Tuning your Centos6.X/Redhat 6.X can increase performance of Hadoop by 20-25%. Yes! 20-25% :-)


Let’s get started and see what parameters we need to change on OS level.


1. Turn off the Power savings option in BIOS:

This will increase the overall system performance and Hadoop Performance. You can go to your BIOS Settings and change it to PerfOptimized from power saving mode (this option may be different for your server based on vendor). If you have remote console command line available then you can use racadm commands to check the status and update it. You need to restart the system in order to get your changes in effect.



2. Open file handles and files:

By default number of open file count is 1024 for each user and if you keep it to default then you may face java.io.FileNotFoundException: (Too many open files) and your job will get failed. In order to avoid this scenario set this number of open file limit to unlimited or some higher number like 32832.



ulimit –S 4096
ulimit –H 32832

Also, Please set the system wide file descriptors by using below command:

sysctl –w fs.file-max=6544018

As above kernel variable is temporary and we need to make it permanent by adding it to /etc/sysctl.conf. Just edit /etc/sysctl.conf and add below value at the end of it





3. FileSystem Type & Reserved Space:

In order to get maximum performance for your Hadoop job, I personally suggest by using ext4 filesystem as it has some advantage over ext3 like multi-block and delayed allocation etc. How you mount your file-system will make difference because if you mount it using default option then there will excessive writes for file or directory access times which we do not need in case of Hadoop. Mount your local disks using option noatime will surely improve your performance by disabling those excessive and unnecessary writes to disks.

Below is the sample of how it should look like:


UUID=gfd3f77-6b11-4ba0-8df9-75feb03476efs /disk1                 ext4   noatime       0 0
UUID=524cb4ff-a2a5-4e8b-bd24-0bbfd829a099 /disk2                 ext4   noatime       0 0
UUID=s04877f0-40e0-4f5e-a15b-b9e4b0d94bb6 /disk3                 ext4   noatime       0 0
UUID=3420618c-dd41-4b58-ab23-476b719aaes  /disk4                 ext4   noatime       0 0


Note – noatime option will also cover noadirtime so no need to mention that.


Many of you must be aware that after formatting your disk partition with ext4 partition, there is 5% space reserved for special operations like 100% disk full so root should be able to delete the files by using this reserved space. In case of Hadoop we don’t need to reserve that 5% space so please get it removed using tune2fs command.



tune2fs -m 0 /dev/sdXY


Note – 0 indicates that 0% space is reserved.



4. Network Parameters Tuning:


Network parameters tuning also helps to get performance boost! This is kinda risky stuff because if you are working on remote server and you did a mistake while updating Network parameters then there can be a connectivity loss and you may not be able to connect to that server again unless you correct the configuration mistake by taking IPMI/iDrac/iLo console etc. Modify the parameters only when you know what you are doing :-)

Modifying the net.core.somaxconn to 1024 from default value of 128 will help Hadoop by as this changes will have increased listening queue between the master and slave services so ultimately number of connections between master and slaves will be higher than before.


Command to modify net.core.somaxconnection:

sysctl –w net.core.somaxconn=1024

To make above change permanent, simple add below variable value at the end of /etc/sysctl.conf




MTU Settings:

Maximum transmission unit. This value indicates the size which can be sent in a packet/frame over TCP. By default MTU is set to 1500 and you can tune it have its value=9000, when value of MTU is greater than its default value then it’s called as Jumbo Frames.


Command to change value of MTU:

You need to add MTU=9000 in /etc/sysconfig/network-scripts/ifcfg-eth0 or whatever your eth device name. Restart the network service in order to have this change in effect.


Note – Before modifying this value please make sure that all the nodes in your cluster including switches are supported for jumbo frames, if not then *PLEASE DO NOT ATTEMPT THIS*



5. Transparent Huge Page Compaction:


This feature of linux is really helpful to get the better performance for application including Hadoop workloads however one of the subpart of Transparent Huge Pages called Compaction causes issues with Hadoop job(it causes high processor usage while defragmentation of the memory). When I was benchmarking our client’s cluster I have observed some fluctuations ~15% with the output and when I disabled it then that fluctuation was gone. So I recommend to disable it for Hadoop.



echo never > /sys/kernel/mm/redhat_transparent_hugepages/defrag


In order to make above change permanent, please add below script in your /etc/rc.local file.

if test -f /sys/kernel/mm/redhat_transparent_hugepage/defrag; then echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag ;fi



6. Memory Swapping:

For Hadoop Swapping reduces the job performance, you should have maximum data in-memory and tune your OS so that it will do memory swap only when there is situation like OOM (OutOfMemory). To do so we need to set vm.swappiness kernel parameter to 0



sysctl -w vm.swappiness=0


Please add below variable in /etc/sysctl.conf to make it persistent.


I hope this information will help someone who is looking for OS level tuning parameters for Hadoop. Please don’t forget to give your feedback via comments or ask questions if any.
Thank you :-) I will publish second part in the next week!



facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Migration of Ambari server in Hortonworks Hadoop 2.2

There could be various scenarios in which migration of Hortonworks Management server (Ambari server) is required. For instance, due to hardware issues with server etc.

In this article we will discuss migration of Ambari server in Hortonworks Hadoop in production cluster. Assuming that your Ambari server has been setup using default DB which is PostgresSQL.

PS: This requires Ambari server downtime. This means that the Management UI won’t be available but the services will continue running in background. Plan accordingly.




Prerequisites :


·         Please ensure that passwordless ssh is setup between new ambari server and other hosts in cluster.

·         Also, the /etc/hosts file should be updated with info of all hosts in the cluster.

·         Java must be installed.

·         HDP and ambari repo must be configured.


This is not a long exercise. So let us begin :-)


1. Stop ambari-server.


ambari-server stop



2. Now we need to backup the Ambari database for migration. In production environments, ideally backups should be scheduled daily/weekly to prevent loss in case of irrecoverable server crash.


  • make a directory to store your backup :
mkdir /tmp/ambari-db-backup
cd /tmp/ambari-db-backup
  • Backup Ambari DB:
pg_dump -U $AMBARI-SERVER-USERNAME ambari > ambari.sql Password: $AMBARI-SERVER-PASSWORD
pg_dump -U $MAPRED-USERNAME ambarirca > ambarirca.sql Password: $MAPRED-PASSWORD


Note : Defaults for $AMBARI-SERVER-USERNAME is ambari-server, $AMBARI-SERVER-PASSWORD is bigdata, $MAPRED-USERNAME is mapred and $MAPRED-PASSWORD is mapred.

These values are configured while ambari-server setup.



3. Stop ambari-agent on each host in the cluster


ambari-agent stop



4. Edit /etc/ambari-agent/conf/ambari-agent.ini and update the hostname to point to the new ambari-server host.



5. Install the ambari-server on the new host.


yum install ambari-server



6. Stop this new ambari server so that we can restore it from the backup of old.


ambari-server stop



7. service postgresql restart



8. Login to postgresql terminal :


su - postgres



9. Drop the newly created DB setup while installing ambari on new host.


drop database ambari;
drop database ambarirca;



10. Check DBs have been dropped.


/list (You have to execute this command in psql teminal)



11. Now create new DBs to restore the old backup into them.


create database ambari;
create database ambarirca;



12. Check DBs have been created.





13. Exit psql prompt



14. Copy the DB backup from old to new ambari host by using scp.



15. Restore DB on new ambari host from backup. PS : you are still logged in as postgres user for this.


psql -d ambari -f /tmp/ambari.sql
psql -d ambarirca -f /tmp/ambarirca.sql



16. Start ambari-server as root now.


ambari-server start



17. Open the Ambari Web UI

http://<new ambari host>:8080   (Open in your favorite browser)



18. Restart a service (say MapReduce) using the Management UI to verify all is working fine.



Done! We have successfully migrated Ambari server to a new host and restored it to its original state. Enjoy! :)

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather