Why Governance and Security are better together?
How do you keep track of large number of diverse data objects (think hundred thousand data entities) in your data lake that continue to increase every day. Now that Apache Hadoop has become a critical component of your data architecture, how do you know with confidence which piece of data came from which source and how did it change over time? Moreover, how do you use this valuable information to secure your Hadoop ecosystem. The answer lies in data classification or metadata – information that describes your data and includes data models, schemas, and attributes such as title, author, subject, tags, date created and description.
Hortonworks created Data Governance Initiative in 2015 to address the need for open source governance solution to manage organization requirements for data classification, centralized policy engine, data lineage, security and data lifecycle management.
Apache Atlas was launched as a result of this data governance initiative. Hortonworks and the community partners have continued to deliver on the original vision of the data governance initiative and this week we are announcing new features for Apache Atlas, including integration with Apache Ranger and the introduction of cross component lineage.
Atlas – Ranger Integration
Atlas provides data governance capabilities and serves as a common metadata store that is designed to exchange metadata both within and outside the Hadoop stack. Ranger provides a centralized user interface that can be used to define, administer and manage security policies consistently across all the components of the Hadoop stack. The Atlas/ Ranger integration represents a paradigm shift for both Hadoop-based data governance and security, bringing together aspects of data classification and metadata store in Atlas with fine grained security enforcement in Ranger.
By integrating Atlas with Ranger enterprises can now implement dynamic classification-based security policies, in addition to role-based security. Ranger’s centralized platform empowers data administrators to define security policy based on Atlas metadata tags or attributes and apply this policy in real-time to the entire hierarchy of assets including databases, tables and columns, thereby preventing violations from occurring.
Through the combination of Atlas and Ranger companies can create a flexible security profile that meets the needs of data-driven enterprises. The access policies that can be constructed and enforced with this capability include:
1. Classification-based access controls: A data asset such as a table or column can be marked with the metadata tag related to compliance or business taxonomy (such as “PCI”). This tag is then used to assign permissions to a user or group. This is an evolution from role-based entitlements, which require discrete and static one-to-one mapping between user/group and resources such as tables or files. As an example, data steward can create a classification tag “PII” and assign certain Hive table or columns to the tag “PII”. By doing this, the data steward is denoting that any data stored in the column or the table has to be treated as “PII”. The data steward now has the ability to build a security policy in Ranger for this classification and allow certain groups or users to access the data associated with this classification, while denying access to other groups or users. Users accessing any data classified as “PII” by Atlas would be automatically enforced by the Ranger policy already defined.
As an example, data steward can create a classification tag “PII” and assign certain Hive table or columns to the tag “PII”. By doing this, the data steward is denoting that any data stored in the column or the table has to be treated as “PII”. The data steward now has the ability to build a security policy in Ranger for this classification and allow certain groups or users to access the data associated with this classification, while denying access to other groups or users. Users accessing any data classified as “PII” by Atlas would be automatically enforced by the Ranger policy already defined.
2. Data Expiry-based access policy: For certain business use cases, data can be toxic and have an expiration date for business usage. This use case can be achieved with Atlas and Ranger. Apache Atlas can assign expiration dates to a data tag. Ranger would inherit the expiration date and would automatically deny access to the specific data tagged after the expiration date.
3. Location-specific access policies: Similar to time-based access policies, administrators can now customize entitlements based on geography. For example, a U.S.-based user might be granted access to data while she is in a domestic office but not while in Europe. Although the same user may be trying to access the same data, the different geographical context would apply, triggering a different set of privacy rules to be evaluated.
4. Prohibition against dataset combinations: With Atlas/ Ranger integration, it is now possible to define a security policy that restricts combining two data sets. For example, if one column consists of customer’s account numbers and the other customer names. These column may be in compliance individually, but pose a violation if combined as part of a query. Administrators can now apply a metadata tag to both data sets to prevent them from being combined, helping avoid privacy violation.
Cross Component Lineage
Apache Atlas now provides the ability to visualize cross-component lineage, delivering a complete view of data movement across a number of analytic engines such as Apache Storm, Kafka, Falcon and Hive.
This functionality offers important benefits to data stewards and auditors. For example, data that starts as event data through Kafka bolt / Storm Topology is also analyzed as aggregated dataset through Hive and then combined with reference data from a RDBMS via sqoop is governed by Atlas. Data stewards, operations, and compliance now have the ability to visualize a data set’s lineage and then drill down into operational, security and provenance-related details. As this tracking is done at the platform level, any application that use these engines will be natively tracked. This allows for extended visibility beyond a single application view.
Cross component lineage functionality has made strides to deliver on the promise of enterprise governance rigor across not only in Hadoop but across the enterprise data architecture. With metadata exchange through open REST APIs, Atlas could be an integral part of providing dataset lineage across the entire enterprise.
As companies deploy multi-tenant data lake this holistic ability to track and visualize lineage is essential to any data governance program.
Where to get these features?
The Atlas-Ranger integration and cross component lineage is available as public preview, starting today, in the form a packaged VM image. You can download the VM from here. You can access Tag Based Policies or Cross Component Lineage tutorials to test the new Atlas features in the VM.