Unity Catalog in Databricks

Unity Catalog is a unified governance solution for data and AI on the Databricks platform.

What we have done is

Workspace Creation (Databricks): You've established your collaborative environment within the Databricks platform. This is where you and your team can develop and run data engineering, data science, and machine learning workflows.

Notebook Usage: You've utilized Databricks notebooks as your primary interface for writing and executing code. Notebooks are interactive environments that allow you to combine code (Python, Scala, SQL, R), visualizations, and documentation.

Cluster Configuration (Spark Processing Engine): You've set up the computational resources needed to process your data using Apache Spark. This involves defining the size and configuration of your Spark clusters, which will distribute the workload across multiple nodes for parallel processing.

Output Storage: You've persisted the results of your Spark processing in different forms within a storage account (likely Azure Data Lake Storage Gen2 or a similar service, given the context of Databricks on Azure):

Managed Tables: These tables are fully managed by Databricks. The data and metadata are stored in the default managed location associated with your Databricks workspace or Unity Catalog metastore. When you drop a managed table, both the schema and the underlying data are deleted. These tables typically use the Delta Lake format by default.
External Tables: These tables have their metadata managed by Databricks, but the underlying data resides in a location you specify in your storage account. When you drop an external table, only the metadata is removed; the data in the storage account remains untouched. External tables can be created on various file formats.
File Format: You've also stored data directly in your storage account in a specific file format (e.g., CSV, JSON, Parquet, Delta files not associated with an external table definition). This provides raw data storage that can be accessed by various tools and services.
Now we have to establish a secure connection between the databricks to storage account, and we have to manage the secure compute clusters as well as the databricks notebook
First we need to securely connect the workspace with the storage account in a secure manner, for which we have access keys Azure Storage Gen, which is not recommended because we feel it's like we are exposing our complete storage account and we have Shared access signature which is also not recommand because mostly only third party who want to access will come there.
The third option we have is Managed Identity is the recommended approach to connect securely. This can be done using IAM (access control ) is applicable for every resource inside Azure.
Azure entra ID is also there This was created when we automatically create the account.
By this entra id we can track entire thing the user are doing within the azure account .
You can access these logs and reports through the Microsoft Entra admin center (formerly Azure portal). Navigate to Entra ID > Monitoring & health. Here you'll find options for:
- Audit logs
- Sign-in logs
- Provisioning logs
- Usage and insights
- Diagnostic settings (to configure where logs are sent for long-term storage and analysis)
By regularly reviewing and analyzing these logs, you can gain valuable insights into user behavior, identify potential security risks, troubleshoot issues, and ensure compliance within your Azure environment.
Databricks is not part of Azure before the Databricks and storage account are connected using the Service principle, which is available inside the Microsoft extra ID.Later, Unity catalog comes to the picture, its not only securely connects the storage and Databricks but also manages all the resources in the Databricks securely.

Databricks access connector is used to connect Databricks to storage via managed identity, and Unity Catalog enablement and the creation of its default catalog are tied to the workspace.

Steps to create the Service principal since some of the projects are still using the service principal

Microsoft Entra ID: You start by accessing the central identity management service for Microsoft Azure.

Default Directory: If you have multiple Azure Active Directory tenants, ensure you are in the correct one. "Default directory" usually refers to the primary tenant associated with your Azure subscription.

App registrations: Historically, Service Principals were managed under "App registrations". While the navigation might slightly differ in the latest Azure portal interface, the concept remains the same. You're looking for the registration of the application or service that the Service Principal represents.

Certificates & secrets: Once you've selected the specific App registration (which has an associated Service Principal), you navigate to the section where you can manage its credentials. This is typically labeled "Certificates & secrets" (or similar).

Creating Credentials: Here, you have two main options for authentication:

Certificates: You can upload your own X.509 certificate or have Azure generate one. Certificate-based authentication is often considered more secure for long-lived applications.
Client Secrets: These are password-like strings that Azure generates. They have an expiration date and need to be managed carefully.

below is the database nity controller

once the storage account is connected with the databricks, unity catalog by default creates on storage account for monitoring purpose

Databricks Workspace Storage:

When you create an Azure Databricks workspace, Azure automatically provisions a storage account that is used internally by Databricks. This storage account is where the Databricks File System (DBFS) root is located.
DBFS is a distributed file system mounted into your Databricks workspace and available on Databricks clusters. It's used for various purposes, including storing notebooks, libraries, experiment results, and data.
This workspace storage account is typically separate from any storage accounts you might connect to explicitly for your data using Unity Catalog or access connectors.
The naming convention for this storage account often includes the workspace name and some unique identifiers.

Now we know how Databricks is securely connected with the storage account .In databricks , now we have to manually create the storage information under the catalog in workspace --> external data --> credential tab we need to give connector id in properties

In order to access the container we need to move to near tab (external locations ) under that we need to register the container whichis present in storage account

Creating the unity catalog

step 1:

Go for Access connector for databricks :

the one is already there is default catalog in order to create new click on the create button .

Enter the resource group as well as project name in the name under instance details . Then review and create

Go to project related storage lake account, then click add role then give storage blog contributor

and add the created database access controller under the select members

then review and Assign

step 3: Regsister Databricks Access connector in workspace catalogue.

go to the created databricks access connector under properties, we can see the id

go the catalog in the workspace and go to the external data tab .

give credentidal name and access connector id and then click create .

if you want to create new connection for external storage go to catalog then external location tab

give the External location name as well as URL (copy and paster the URL in notepad and replace the container and storage account ) and for storage credential click required then click create .

then working lab (created )container link appears .

Then navigate to the databricks notebooks and place dbutils .fs.ls("file location of working labs ") and connect with the cluster and run

Likewise give it for bronze and silver , gold containers.

and create one new container to create to store all related storage required for create unity catalogue .

create the external location for the created cointainer.

step 6:

create project specific unity catalog

Go to the catalog under the workspace, click + and give the name as project name (pricing_analysis) and the storage location that was created (pricing_analysis ) dedicated for this going to be created catalog.

What is the advantages of Configuring Unity Catalogue?

Centralized Data Governance
Provides Fine Grained Access Control on Data and Data Assets
Can used across multiple Databricks workspaces in Same Region
Data Lineage and Metadata captured at various levels

What is meta store in Unity Catalogue?

Meta store is the underlying storage layer for all of the objects and data assets uses Unity Catalogue

Explain how to configure Unity Catalogue?

Just explain all of the steps we have done in this module
Create "Databricks Access Connector" Resource
Give "Storage Blob Data Contributor" access for Databricks Access Connector in Azure Data Lake Storage Account
Register "Databricks Access Connector" properties inside Workspace Catalogue
Link Storage account containers inside Workspace Catalogue
Create and Link Storage account container to store underlying data for all data objects created in unity catalogue
Create new catalogue and link its own storage location
To use them in the code always follow 3 level names catalogue.schema.table_name

Keerthana Blogs

Search This Blog