Security of a Hadoop cluster

Hadoop is not only a data processing but a data warehouse solution.

When we are talking about high amount of data we are talking about business value (the basis of the services) for a company.

As Hadoop uses many nodes, network devices all the equipment must have been included in a security plan – no weakest links are allowed.

To secure our assets the following tools and approaches can be used.

  • Planning and securing the network (e.g. usage of firewalls, securing all network peripherals)
  • Access control settings, strengthening authentication – e.g. using a jumbpox server)
  • File permissions, user and group access set fine-grained
  • Thorough logging
  • Securing the environment (OS and daemons outside of Hadoop)
    • Not only software and IT system security but physical access control is also taken into account
  • Described disaster recovery plans and procedures
  • Data backup policy

 

Securing Hadoop – that was not built by focusing on security.

Some of the problems:

  • No fine grain auth: Kerberos
  • Data is not encrypted on HDFS
  • Hadoop moves its data throughout the network in plain mode (network sniffing possible)
  • Hadoop is never used alone (security of other components)
  • Backup & recovery is not working as for smaller datasets
  • HDFS client spoofing is possible

 

Hadoop has a secure mode: http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SecureMode.html

Features:

 

Tools available:

  • Apache Knox – The Knox Gateway provides a single access point for all REST interactions with Hadoop clusters. In this capacity, the Knox Gateway is able to provide valuable functionality to aid in the control, integration, monitoring and
    automation of critical administrative and analytical needs of the enterprise.
  • Project Rhino – Protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address challenges, and contribute the code back to Apache.
  • Apache Sentry – Sentry is a highly modular system for providing fine grained role based authorization to both data and metadata stored on an Apache Hadoop cluster.

 

The developer community is actively working on the problems to provide solutions out-of-the-box. E.g.

  • REST data encryption (Hadoop-9331)
  • Token based authentication (MapReduce-5025)
  • HBase Per KeyValue security (HBase-6222)