Azure Databricks Best Practices for Security Teams

Written by Imran Abdul Rauf

Technical Content Writer

Over 91% of CEOs and business owners state that business transformation is one of the biggest drivers for investing in big data. However, with significant investment, potential, and usage, comes great concerns for security measures.

This blog will discuss the platform as a quick, collaborative analytics platform and the Azure Databricks best practices for security.

Azure Databricks—a brief introduction

Azure Databricks is an Apache Spark-based analytics platform built upon MLflow, Delta Lake, Redash, Koalas, and Apache Spark infrastructure. The platform is a first-party PaaS, and a part of the Microsoft Azure Cloud that offers easy, one-click setup, native integrations associated with other Azure cloud services, interactive workspaces, and enterprise-level security to enable data and AI use cases for different customer segments.

The primary purpose of Azure Databricks is to facilitate user-friendly collaboration between data scientists, data engineers, data analysts, and cloud engineering experts. Following are the Azure Databricks best practices used by enterprises to secure their network infrastructure.

Azure Databricks best practices for security

Unlock the true potential of your data lake

As enterprises enter data silos, they extract all their data from different sources and move into a data lake, where business analysts, data scientists, and data engineers are allowed to process and query the data. Although this helps users counter the challenge of making data available for themselves, a new challenge arises in securing and isolating various data classes from unauthorized users.

Databricks employ cloud-native controls for security by unifying the data in an open format. The platform integrates with IAM, AAD for identity, KMS for data encryption, security groups for quick firewalls, and STS for access tokens. This allows teams to thrust more control over their trust anchors, centralize their access control guides in a single location and incorporate them towards Databricks easily. The other methods include:

  • Isolating the environment through steps such as cloud workspaces.
  • Removing PII data
  • Firm access control and encryption

Bring Your Own Network

Azure Databricks is a managed tool comprised of two major components, the Control Plane and the Data Plane. Microsoft handles the platform architecture, the data plane VNET, and the network security group, even though they are added to the customer’s subscription. Although the default-deployment mode caters to numerous organizational level customers, some customers wish to exercise more control over their service network configuration for adhering to governance policies and external guidelines, and network customizations, including:

  • Connect Azure Databricks clusters to data sources deployed in on-premise centers
  • Use Databricks clusters to other Azure data services via Azure Service Endpoints
  • Contain outbound traffic to specific Azure data services and/or external instances only
  • Customize Databricks clusters to use custom DNS

One of the Azure Databricks best practices and a key to understanding the platform architecture, the Bring Your Own VNET feature makes the above and other activities possible. The feature enables users to deploy data planes in their self-managed VNETs. Through this, the customer can also manage the Databricks workspace NSG. In addition, the platform relies on various inbound and outbound NSG regulations through Network Intent Policy.

The intent policies are responsible for validating secure, bidirectional communication to the management plane.

Facilitate secure cluster connectivity

As we mentioned earlier, Azure Databricks is a managed application allowing users to avail improved security capabilities. Secure Cluster Connectivity helps IT and data management teams in the following ways.

  • No public IPs: There is no risk of any direct public access as there aren’t any Public IP addresses for the nodes throughout the workspace clusters.
  • No open inbound ports: All kinds of access are authorized from a cluster in the data, either outbound or internal to the cluster. The outbound access comes with the connectivity to the Secure Cluster Connectivity Relay located on the control plane, which permits running the customer workloads and passing through all cluster administration tasks.
  • Improved reliability and scalability: The user’s data platform equips enhanced reliability and scalability responsible for handling large and extra-large workloads due to no dependency on launching the same number of public IPs as cluster nodes.

You can deploy a workspace onto a secure cluster with Secure Cluster Connectivity through Managed-VNET or VNET Injection (commonly termed Bring Your Own VNET) modes.

See which networks are authorized for workspace access

A data-driven business that uses SaaS cloud tools are particularly concerned about security lapses and need to exercise access for their own workforce. Although authentication helps validate the user profile, still that doesn’t necessarily provide the network location. Hence, accessing a cloud from an unknown, insecure network can put enterprise security protocols ta risk. For example, suppose an administrative employee access an Azure Databricks workspace. If the person walks away from the access location, the company can block connections to the workspace even if the customer has the proper credentials for accessing the REST API or web application.

Users can customize the Databricks workspaces limiting employees only for access through currently running secure, corporate perimeter. For remote working employees or those traveling, they can access the official network through a VPN which opens access to the workspace. Data professionals use the IP access lists feature in the Azure Databricks and define a set of approved IP addresses. Moreover, the IP access lists feature is flexible, letting workspace administrators specify IP addresses and update the REST APIs to update and manage the list of selected, secure IP addresses and subnets.

Verify through audit logs

As businesses continuously work on updating their data ecosystem to make better, informed decisions, the data workload grows at a tremendous rate. Cloud data lakes become the focal point for creating different functions, tools, and technologies and extracting valuable insights. Unfortunately, the addition of more teams, data, and users means further threats to business data and security.

Related content: Data Lake vs. Data Warehouse—Which One to Choose?

Databricks helps tackle this challenge by granting users a plethora of controls and balancing the broader access of data throughout all hierarchy levels. The platform provides visibility through:

  • Audit Logs, including Workspace Access Control and Workspace Budget Control
  • Cloud Provider Infrastructure Logs like Data Access Security Controls and Data Exfiltration Controls.

Databricks users are also content with everything that needs to be audited and how tracking workspace activity helps security and administration teams acquire insights they need to access data and stay compliant with enterprise governance rules.

Cluster policies

Cluster policies restrict the ability to configure clusters based on rules that limit the usage of attributes used for creating clusters. Teams can use cluster policies to impose specific cluster settings, including instance types, attached libraries, number of nodes, computing expense, and various cluster creation interfaces for different user levels.

Small enterprises might work with a single cluster policy for all clusters. While a large organization might work through more complex policies. For instance:

  • Customer data analysts accustomed to working on large data sets and complex scenarios might be permitted to work on clusters consisting of up to a hundred nodes.
  • An HR team that works with smaller, simpler datasets and notebooks might obtain autoscaling clusters ranging from 4 to 8 nodes.

Takeaways

Azure Databricks best practices for security concerns tempts users to unlock the actual potential of the data lake, use VNET, enable secure cluster connectivity, learn which networks are enhanced for workspace access, verify through audit logs, and use cluster policies. Moreover, users can leverage other security measures like enabling customer-managed keys for DBFS, simplifying data lake access, authenticating through Azure Active Directory tokens, handling token management, etc.

How can Royal Cyber help?

Royal Cyber is a partner of Databricks helping businesses unlock the true potential of their data and make smart decisions for their data teams, security personnel, and cloud infrastructure and analytics purposes.

Leave a Reply