Dillon Woods
CTO @
zData, Inc.

Hadoop Kerberos Guide

July, 2015

This article includes the following sections:

Introduction

Data is increasingly becoming one of the most valuable assets of any enterprise, so it is no surprise that controlling access to that data is a top priority. Most data platforms have had robust authentication and authorization features for decades. PostgreSQL offers 10 different methods for user authentication while Oracle offers an even longer list of options. Even relative NoSQL newcomers like MongoDB offer all of the standard authentication mechanisms built into their products.

Given the importance of data management and the long legacy of systems that support multiple types of authentication standards, most people are shocked to find out that Hadoop does not come out of the box with any data security whatsoever. The most common way to overcome this shortcoming is to “kerberize” Hadoop clusters to force Kerberos authentication, but this approach presents its own shortcomings, especially for applications that interact with Hadoop. This article will describe the basic concepts behind securing a Hadoop cluster with Kerberos and designing applications to interact with it.

HDFS Security Model

Data in the Hadoop ecosystem is most often stored in the Hadoop Distributed File System (HDFS). HDFS closely follows the standard POSIX file system model with one very important exception: there is no formalized concept of users or groups. You can set owners or groups for files or directories, but they are simply stored as strings. From the official HDFS documentation:

There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials.

In a standard Unix environment we are given an error message if we attempt to set the owner of a file to a user that doesn’t exist in the system:

$ ls -l test.txt
-rw-r--r--@ 1 dwoods  staff  0 Jul 21 15:21 test.txt
$ chown xxx test.txt
chown: xxx: illegal user name

This is not the case in Hadoop. Since the user and group attributes are simply strings, we can set them to any arbitrary value that we want:

$ hadoop fs -ls -d /user/admin/test
drwxr-xr-x   - hdfs admin          0 2015-07-21 22:07 /user/admin/test
$ hadoop fs -chown xxx /user/admin/test
$ hadoop fs -ls -d /user/admin/test
drwxr-xr-x   - xxx admin          0 2015-07-21 22:07 /user/admin/test

This allows any user to trivially bypass the HDFS security model or to change file permissions at will:

$ whoami
dwoods
$ hadoop fs -chown dwoods /user/admin/test2
chown: changing ownership of '/user/admin/test2': Permission denied. user=dwoods is not the owner of inode=test2
$ HADOOP_USER_NAME=hdfs hadoop fs -chown dwoods /user/admin/test2
$ hadoop fs -ls -d /user/admin/test2
drwxr-xr-x   - dwoods admin          0 2015-07-21 22:11 /user/admin/test2

The HDFS file system authorization model is useless without proper authentication.

Kerberos for Hadoop

The only real option for organizations who want to enforce authentication is to “Kerberize” their Hadoop clusters. At a basic level, this process involves the following steps:

  1. Set up a local Key Distribution Center (KDC) on the Hadoop cluster. Note that most organizations will already have an existing corporate KDC, but it is recommended to create a separate KDC for the Hadoop cluster.
  2. Create Service Principals for each Hadoop service such as hdfs, mapred, yarn, etc. These can be thought of as the service accounts for each Hadoop component.
  3. Create encrypted Kerberos keys, known as Keytabs, for each Service Principal. Much like SSH keys, the Kerberos Keytabs are required so the services can authenticate and communicate without a password.
  4. Distribute those keytabs to each node of the cluster. Each node of the cluster requires a Keytab for each Service Principal since Keytabs are tied to specific servers or FQDNs.
  5. Configure all services to use Kerberos authentication since cluster communication will be slightly different once Kerberos is enabled.

Most common Hadoop distributions such as Cloudera and HortonWorks provide management utilities and guides to aid in the above process.

We will know the above configuration is successful when we can no longer issue Hadoop commands without seeing a Kerberos related error message.

$ hadoop fs -ls /
15/07/22 13:22:14 WARN security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
15/07/22 13:22:14 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
SSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is:

Hadoop Kerberos Client Configuration

Once the Kerberos configuration is complete we can configure the system to allow a user to authenticate and submit jobs directly.

First, we need to create a Principal for the user in Kerberos. You can think of a Principal as a user account, but note that a Principal is different from the Service Principal accounts we created for Hadoop above. You can authenticate using one of the service principals directly, such as ‘hdfs’, but this is not recommended for obvious reasons. We will create this Principal in the Hadoop KDC. If you wish to create the Principal in a different KDC then you must first enable Cross-Realm Authentication between the KDCs.

$ kadmin -q "addprinc dwoods@CLOUDERA"
Authenticating as principal cloudera-scm/admin@CLOUDERA with password.
Password for cloudera-scm/admin@CLOUDERA:
WARNING: no policy specified for dwoods@CLOUDERA; defaulting to no policy
Enter password for principal "dwoods@CLOUDERA":
Re-enter password for principal "dwoods@CLOUDERA":
Principal "dwoods@CLOUDERA" created.

In the above we use the kadmin “addprinc” command to add a new principal named “dwoods”. We are first prompted for our admin password, then we specify a password twice for the new user. Note that “CLOUDERA” in this case corresponds to the Kerberos Realm this user should be created in. Detailed information on Realms can be found here, but a Realm is an arbitrary upper-case string configured in the KDC that often corresponds to the site domain name.

Now that the user is created we can authenticate with the KDC using the kinit command:

$ kinit dwoods@CLOUDERA
Password for dwoods@CLOUDERA:

If the authentication is successful then we will receive a Ticket-Granting Ticket (TGT) from the KDC. This means that we have authenticated with the server, but have not yet been given permission to access any services. We can examine our Ticket Cache and verify that we received the TGT by using the klist command.

$ klist
Ticket cache: FILE:/tmp/krb5cc_501
Default principal: dwoods@CLOUDERA

Valid starting     Expires            Service principal
07/22/15 13:49:53  07/23/15 13:49:53  krbtgt/CLOUDERA@CLOUDERA
    renew until 07/29/15 13:49:53

We can now perform a simple HDFS directory listing:

$ hadoop fs -ls /
Found 5 items
drwxr-xr-x   - hbase supergroup          0 2015-07-20 13:46 /hbase
drwxr-xr-x   - solr  solr                0 2014-12-18 04:33 /solr
drwxrwxrwx   - hdfs  supergroup          0 2015-07-20 13:46 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2015-02-02 17:48 /user
drwxr-xr-x   - hdfs  supergroup          0 2014-12-18 04:32 /var

It is important to note that we can no longer use the trick from above to impersonate another user or bypass HDFS permissions.

$ HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/cloudera/test
mkdir: Permission denied: user=dwoods, access=WRITE, inode="/user/cloudera":cloudera:cloudera:drwxr-xr-x

Creating the Kerberos Principal is all that is required to create a new user since HDFS doesn’t provide any facility for creating or managing users. We can immediately create a home directory for this user and start giving them permission to access files and directories.

Note that the kinit command above will only work from a machine where Kerberos has already been configured to communicate with the appropriate KDC. This should already be the case for our Hadoop nodes, but if we wish to authenticate from an external machine, such as a laptop, then we must first update the Kerberos configuration file. This file is commonly named krb5.conf and located in the /etc directory, but its location can be changed with the $KRB5_CONF environment variable.

Full documentation for krb5.conf is available online, but the most basic setup should look like this:

[libdefaults]
    default_realm  = CLOUDERA

[realms]
    CLOUDERA = {
        kdc = kdc.cloudera.mycompany.com:88
    }

[domain_realm]
    kdc.cloudera.mycompany.com = CLOUDERA

The above configuration specifies how to reach the KDC for the CLOUDERA realm. Once it is configured our kinit should work as expected from any external machine. This assumes that our machine can resolve the hostname quickstart.cloudera. We may need to edit /etc/hosts to enable routing to that hostname.

Finally, we can logout, or remove our TGT, by using the kdestroy command.

$ kdestroy
$ klist
klist: krb5_cc_get_principal: No credentials cache file found

Designing Kerberos Aware Applications

Kerberos can pose a major challenge for software companies trying to design applications or services that are capable of interacting with Hadoop. Instead of simply being another Kerberos client, these applications are often a middleware between users and the Hadoop system. There are really four options for designers of Kerberos aware applications:

  1. Simple Hadoop Client
  2. Service Account Authorization
  3. Kerberos Impersonation
  4. Single Sign On

Each of these approaches provides an increasing level of security for the application at the expense of being more complicated to implement. These techniques are described in detail below.

Simple Hadoop Client

First, we’ll create a simple Java program that takes an HDFS path and prints the number of files in that directory. The first version of this program wont contain any authentication; instead, it will depend on the user running it to have already obtained a TGT from the KDC. This is not a valid approach for services or long running applications since an administrator will have to manually authenticate with Kerberos and then continually refresh the TGT as it expires.

import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

class FileCount {
    public static void main(String[] args) throws IOException, FileNotFoundException{
        String path = "/";
        if( args.length > 0 )
            path = args[0];

        FileSystem fs = FileSystem.get(new Configuration());
        FileStatus[] status = fs.listStatus(new Path("hdfs://quickstart.cloudera:8020" + path));
        System.out.println("File Count: " + status.length);
    }
}

We can run the program on the ‘/user’ directory and then verify it is correct by using the native Hadoop client.

$ java FileCount /user
File Count: 6
$ hadoop fs -ls /user
Found 6 items
drwxr-xr-x   - cloudera cloudera          0 2015-07-20 14:02 /user/cloudera
drwxr-xr-x   - mapred   hadoop            0 2015-02-02 17:46 /user/history
drwxrwxrwx   - hive     hive              0 2014-12-18 04:33 /user/hive
drwxrwxr-x   - hue      hue               0 2015-02-02 17:53 /user/hue
drwxrwxrwx   - oozie    oozie             0 2014-12-18 04:34 /user/oozie
drwxr-xr-x   - spark    spark             0 2014-12-18 04:34 /user/spark

If we are running this against a Kerberized Hadoop cluster then it will only work correctly if we have already obtained a TGT from the KDC using kinit. If we do not have a TGT, then an error like this one will be thrown:

$ kdestroy
$ java FileCount /user
15/07/30 09:19:54 WARN security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
...
Exception in thread "main" java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "quickstart.cloudera/127.0.0.1"; destination host is: "quickstart.cloudera":8020;
...

Service Account Authorization

The first technique for actually integrating an application with a Kerberized Hadoop cluster is the easiest to implement but also the least secure. It involves creating a single service principal for the application. All end-users of the application will effectively use this same service account for authentication. The obvious downside to this is that data level security can not be defined on a per-user basis at the HDFS level. The application could be designed to handle data authorization, but a Hadoop administrator can not directly control, verify, or audit this. This technique is not acceptable for most enterprise grade products or services.

To set this up we first create a Service Principal for our application:

$ kadmin -q "addprinc -randkey myapplication/myhost@CLOUDERA"
Authenticating as principal cloudera-scm/admin@CLOUDERA with password.
Password for cloudera-scm/admin@CLOUDERA:
WARNING: no policy specified for myapplication/myhost@CLOUDERA; defaulting to no policy
Principal "myapplication/myhost@CLOUDERA" created.

We created the service principal with the name myapplication/myhost@CLOUDERA which corresponds to the name of our service or application, the host it is intended to be run from, and the Realm it is to be associated with.

The -randkey option was used in order to specify that a random encryption key should be chosen instead of a user generated password. This is a more secure choice than choosing a password since we should never be authenticating the service principal directly.

To verify that the service principal was created correctly we can use the getprinc command. The output from this command is truncated below, but we can see that the principal does indeed exist in the KDC.

$ kadmin -q "getprinc myapplication/myhost"
Principal: myapplication/myhost@CLOUDERA
Expiration date: [never]
Last password change: Tue Jul 28 14:35:16 PDT 2015
...

Next we will export a keytab for this service principal. As above when we Kerberized the Hadoop cluster, this keytab will allow our service to authenticate with the KDC without requiring a password.

$ kadmin -q "xst -k myapplication.keytab myapplication/myhost"
Entry for principal myapplication/myhost with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:myapplication.keytab.
Entry for principal myapplication/myhost with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:myapplication.keytab.
Entry for principal myapplication/myhost with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:myapplication.keytab.
Entry for principal myapplication/myhost with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:myapplication.keytab.
Entry for principal myapplication/myhost with kvno 2, encryption type des-hmac-sha1 added to keytab WRFILE:myapplication.keytab.
Entry for principal myapplication/myhost with kvno 2, encryption type des-cbc-md5 added to keytab WRFILE:myapplication.keytab.

By default the xst command will generate a key for each encryption type. We could limit this to just one encryption type if desired. We can verify that the keytab was created correctly by using the klist command to make sure it contains the encrypted keys for our service principal.

$ klist -ke myapplication.keytab
Keytab name: FILE:myapplication.keytab
   KVNO Principal
   2 myapplication/myhost@CLOUDERA (aes256-cts-hmac-sha1-96)
   2 myapplication/myhost@CLOUDERA (aes128-cts-hmac-sha1-96)
   2 myapplication/myhost@CLOUDERA (des3-cbc-sha1)
   2 myapplication/myhost@CLOUDERA (arcfour-hmac)
   2 myapplication/myhost@CLOUDERA (des-hmac-sha1)
   2 myapplication/myhost@CLOUDERA (des-cbc-md5)

After it is generated we can move the keytab to the application server, but great care should be taken to make sure that it is protected on disk. It should be readable only by an application user on the server and should be protected in the same manner as an SSH key.

We can now update our FileCount program to use this keytab rather than relying on a TGT to already exist in the cache. This is done by using the UserGroupInformation class from the Hadoop Security package.

import java.io.*;
import java.security.PrivilegedExceptionAction;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.security.*;

class FileCount {
    public static void main(final String[] args) throws IOException, FileNotFoundException, InterruptedException{

        UserGroupInformation ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI("myapplication/myhost", "myapplication.keytab");
        ugi.doAs( new PrivilegedExceptionAction() {
            public Void run() throws Exception {
                String path = "/";
                if( args.length > 0 )
                    path = args[0];

                FileSystem fs = FileSystem.get(new Configuration());
                FileStatus[] status = fs.listStatus(new Path("hdfs://quickstart.cloudera:8020" + path));
                System.out.println("File Count: " + status.length);

                return null;
            }
        } );
    }
}

Above we specify the name of our service principal and the path to the keytab file we generated. As long as that keytab is valid our program will use the desired service principal for all actions, regardless of whether or not the user running the program has already authenticated and received a TGT. We can verify this works by using our kdestroy example from earlier:

$ kdestroy
kdestroy: No credentials cache found while destroying cache
$ java FileCount /user
File Count: 6

Kerberos Impersonation

We can approve on the above setup by configuring Hadoop to allow our application to impersonate individual users. This means that HDFS operations will be performed as the user logged into our application instead of a central service account. This is a much better approach because Hadoop administrators can now centrally control data authorization inside HDFS. At the same time it leaves something to be desired since we now must rely on the application to always impersonate the correct user, otherwise application level security loop holes will be exposed..

The Kerberos setup is very similar to the above: a service principal must be created and a keytab must be generated and exported to the application server. In addition to that we must also make some configuration changes to Hadoop itself. Secure Impersonation can be enabled by setting the following properties in core-site.xml.

<property>
    <name>hadoop.proxyuser.myapplication.groups</name>
    <value>group1,group2</value>
</property>
<property>
    <name>hadoop.proxyuser.myapplication.hosts</name>
    <value>apphost</value>
</property>

Note that myapplication in the above configuration should be replaced with the name of your service principal. “group1,group2” should be replaced with a list of groups that myapplication is allowed to impersonate. This could be set to a wildcard (*) to allow any user to be impersonated, but it is a much better security practice to restrict this to a group of users who have access to the application. Finally “apphost” should be replaced with the hostname of the application server, or a comma separated list of host names if applicable.

We also need to enable service level authorization by setting the hadoop.security.authorization property to “true” in core-site.xml. This is often set to true by default in popular Hadoop distributions such as Cloudera.

<property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
</property>

Once the above Hadoop configuration changes have been made we can update our FileCount program to use secure Kerberos impersonation. The main difference from the previous version is that we now also call the createProxyUser method after the initial login. Note that in this example the user name “dwoods” is hard coded, but in a real world scenario we would get the username from the application request context.

import java.io.*;
import java.security.PrivilegedExceptionAction;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.security.*;

class FileCount {
    public static void main(final String[] args) throws IOException, FileNotFoundException, InterruptedException{

        UserGroupInformation app_ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI("myapplication/myhost", "myapplication.keytab");
        UserGroupInformation proxy_ugi = UserGroupInformation.createProxyUser("dwoods", app_ugi);

        proxy_ugi.doAs( new PrivilegedExceptionAction() {
            public Void run() throws Exception {
                String path = "/";
                if( args.length > 0 )
                    path = args[0];

                FileSystem fs = FileSystem.get(new Configuration());
                FileStatus[] status = fs.listStatus(new Path("hdfs://quickstart.cloudera:8020" + path));
                System.out.println("File Count: " + status.length);

                return null;
            }
        } );
    }
}

As before our sample application returns the correct file count, but this time the operation is actually being performed as the “dwoods” user. Theoretically this would be the user logged into our Hadoop application.

$ java FileCount /user
File Count: 6

If the Hadoop proxy configuration described above was incorrect then we would see an error like this:

$ java FileCount /
15/07/30 10:54:43 WARN security.UserGroupInformation: PriviledgedActionException as:dwoods (auth:PROXY) via myapplication/myhost@CLOUDERA (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: myapplication/myhost@CLOUDERA is not allowed to impersonate dwoods

The open source projects Oozie and Hive both provide examples of the Secure Impersonation authorization technique.

Single Sign On

The limitation with the above approach is that you put the trust in the application to do impersonation correctly. If the application is poorly designed or if an attacker gains access to the application server then they can effectively impersonate any user for any purpose. The solution to this is to implement true “pass through” or Single Sign On security. With this method the application will request an authenticated ticket from the end user and then use that ticket to authenticate with Hadoop. The basic steps for making a SSO request would look like this:

  1. User authenticates with Kerberos locally on their machine
  2. User configures their browser to add the application to their list of trusted sites and to respond to authorization requests with the correct ticket cache
  3. User makes a request to the application
  4. Application requests Kerberos ticket from User
  5. Application passes through ticket to Hadoop service
  6. Hadoop service verifies ticket with KDC and responds to request

How you implement the above will depend on your web application framework, but many options exist for popular web frameworks. Web servers such as Nginx or Apache offer modules for assisting with Kerberos SSO authentication using GSSAPI.

Conclusion

It is very important for every enterprise to understand the authentication and authorization techniques available for Hadoop so they can compare them with their own internal security policies. Once they have that understanding they can carefully evaluate every Hadoop application introduced into the environment to make sure it implements Kerberos in a way that complies with those established guidelines. Many application vendors will cut corners with their Kerberos implementations, but informed customers should be able to quickly identify any shortcomings.

comments powered by Disqus