Azure Databricks Best Practices for Security Teams

Azure Databricks Best Practices for Security Teams

Written by Hafsa Mustafa

Technical Content Writer

June 6, 2023

The term ‘Azure Databricks’ refers to a Unified Analytics Platform that forms a part of the Azure Cloud. This platform is built upon MLflow, Delta Lake, Redash, Koalas, and Apache Spark infrastructure. It is a first-party PaaS that offers one-click setup native integrations which can easily be associated with other Azure cloud services, workspaces, and enterprise-level security to enable diverse data and AI use cases for different customer segments.

The primary purpose of Azure Databricks is to facilitate user-friendly collaboration between data scientists, data engineers, data analysts, and cloud engineering experts. This blog will discuss Azure Databricks as a quick and collaborative analytics platform and the best practices for beefing its security.

Azure Databricks Security Best Practices

Following are the Azure Databricks best practices used by enterprises to secure their network infrastructure.

Put Cloud-native Security Controls to Use

Databricks utilizes cloud-native controls to strengthen security by unifying the data in an open format. The platform seamlessly integrates with IAM and AAD for identity, KMS for data encryption, security groups for quick firewalls, and STS for access tokens.

This allows data teams to exercise more control over their trust anchors, centralize their access control guides in a single location, and incorporate them into Databricks easily.

Besides these already in-place measures, you can use some other methods to fortify your security like:

  • Isolating the environment through steps such as cloud workspaces
  • Removing PII data
  • Administering firm access control and encryption

As enterprises try to break data silos, they extract all their data from different sources and move into a data lake, where business analysts, data scientists, and data engineers are allowed to process and query the data. Although this helps users counter the problem of accessing reliable data in time, a new challenge arises in the form of securing and isolating various data classes from unauthorized users. The above-mentioned tactics can help you overcome the security-related issues.

Utilize the “Bring Your Own VNETs” Feature

One of the Azure Databricks best practices and a key to understanding the platform architecture is the Bring Your Own VNET feature. This feature enables users to deploy data planes in their self-managed VNETs. With its help, the customer can manage the Databricks workspace Network Security Group (NSG). The platform relies on various inbound and outbound NSG regulations through Network Intent Policy. The intent policies are responsible for validating secure and bidirectional communication to the management plane.

Azure Databricks is a managed tool and is comprised of two major components: the Control Plane and the Data Plane. Microsoft handles the platform architecture, the data plane VNET, and the network security group. Although the default-deployment mode caters to numerous organizational level customers, some customers wish to exercise more control over their service network configuration, such as:

  • Connect Azure Databricks clusters to data sources deployed in on-premise centers
  • Use Databricks clusters to other Azure data services via Azure Service Endpoints
  • Contain outbound traffic to specific Azure data services and/or external instances only
  • Customize Databricks clusters to use custom DNS

Facilitate Secure Cluster Connectivity

Secure Cluster Connectivity helps IT and data management teams in the following ways.

  • No public IPs: There is no risk of any direct public access as there aren’t any Public IP addresses for the nodes throughout the workspace clusters.
  • No open inbound ports: All kinds of access are authorized from a cluster in the data, either outbound or internal to the cluster. The outbound access comes with the connectivity to the Secure Cluster Connectivity Relay located on the control plane, which permits running the customer workloads and passing through all cluster administration tasks.
  • Improved reliability and scalability: The user’s data platform equips enhanced reliability and scalability responsible for handling large and extra-large workloads due to no dependency on launching the same number of public IPs as cluster nodes.

You can deploy a workspace onto a secure cluster with Secure Cluster Connectivity through Managed-VNET or VNET Injection (commonly termed Bring Your Own VNET) modes.

Specify Which Networks Are Authorized for Workspace Access

A data-driven business that uses SaaS cloud tools is particularly concerned about security lapses and needs to exercise access for their own workforce. Although authentication helps validate the user profile, it still doesn’t necessarily provide the network location. Hence, accessing a cloud from an unknown, insecure network can put enterprise security protocols at risk.

For example, suppose an administrative employee accesses an Azure Databricks workspace. If the person walks away from the access location, the company can block connections to the workspace even if the customer has the proper credentials for accessing the REST API or web application.

Users can customize the Databricks workspaces by limiting the access to only those employees working through currently running corporate perimeter. Employees working remotely or those traveling can access the official network through a VPN which opens access to the workspace.

Data professionals can use the IP access lists feature in the Azure Databricks and define a set of approved IP addresses. Moreover, the IP access lists feature is flexible, letting workspace administrators specify IP addresses and update the REST APIs to update and manage the list of selected, secure IP addresses and subnets.

Verify Through Audit Logs

As businesses continuously work on updating their data ecosystem to make better and informed decisions, the data workload grows at a tremendous rate. Cloud data lakes become the focal point for creating different functions, tools, and technologies and extracting valuable insights. Unfortunately, the addition of more teams, data, and users means multiplication of threats to business data and security.

Learn the difference between a data lake and a data warehouse in this blog

Databricks helps tackle this challenge by granting users a plethora of controls and balancing the broader access of data at all hierarchy levels. The platform provides visibility through:

  • Audit Logs, including Workspace Access Control and Workspace Budget Control
  • Cloud Provider Infrastructure Logs like Data Access Security Controls and Data Exfiltration Controls

Cluster policies

Cluster policies restrict the ability to configure clusters based on rules that limit the usage of attributes used for creating clusters. Teams can use cluster policies to impose specific cluster settings, including instance types, attached libraries, number of nodes, computing expense, and various cluster creation interfaces for different user levels.

Small enterprises might work with a single cluster policy for all clusters. While a large organization might work through more complex policies. For instance:

  • Customer data analysts accustomed to working on large data sets and complex scenarios might be permitted to work on clusters consisting of up to a hundred nodes.
  • An HR team that works with smaller, simpler datasets and notebooks might obtain autoscaling clusters ranging from 4 to 8 nodes.

Takeaways

Azure Databricks security best practices for security empowers the users to unlock the actual potential of the data lake, use VNET, enable secure cluster connectivity, learn which networks are enhanced for workspace access, verify through audit logs, and use cluster policies. Moreover, users can leverage several other security measures like enabling customer-managed keys for DBFS, simplifying data lake access, authenticating through Azure Active Directory tokens, handling token management, etc.

When it comes to data governance, Azure Databricks provides options like Unity Catalog and Delta Sharing which provide a central place for auditing data access and secure data sharing with other organizations.

How can Royal Cyber help?

Royal Cyber is a partner of Databricks helping businesses unlock the true potential of their data and make smarter decisions for their data teams, security personnel, and cloud infrastructure. Our team enables businesses to utilize their data assets for analytics purposes. If you have any question on the subject, feel free to contact our data analytics experts.

Recent Blogs

  • LLMs in Retail: Which Operations Can You Transform With AI?
    Artificial Intelligence (AI) has been making significant waves across various industries, revolutionizing business operations.Read More »
  • AI-Powered Virtual Body Measurement Solution for Online Shoppers
    According to a report by Statista Research Department, the global fashion eCommerce market stood at …Read More »
  • Personal AI Fitness Trainer with Real-time Feedback
    As the use of technology is becoming more common, more and more people are engaging …Read More »