Skip to main content

Unity Catalog in Databricks

Unity Catalog is a unified governance solution for data and AI on the Databricks platform. 

What we have done is 

Workspace Creation (Databricks): You've established your collaborative environment within the Databricks platform. This is where you and your team can develop and run data engineering, data science, and machine learning workflows.

  • Notebook Usage: You've utilized Databricks notebooks as your primary interface for writing and executing code. Notebooks are interactive environments that allow you to combine code (Python, Scala, SQL, R), visualizations, and documentation.

  • Cluster Configuration (Spark Processing Engine): You've set up the computational resources needed to process your data using Apache Spark. This involves defining the size and configuration of your Spark clusters, which will distribute the workload across multiple nodes for parallel processing.

  • Output Storage: You've persisted the results of your Spark processing in different forms within a storage account (likely Azure Data Lake Storage Gen2 or a similar service, given the context of Databricks on Azure):

    • Managed Tables: These tables are fully managed by Databricks. The data and metadata are stored in the default managed location associated with your Databricks workspace or Unity Catalog metastore. When you drop a managed table, both the schema and the underlying data are deleted. These tables typically use the Delta Lake format by default.

    • External Tables: These tables have their metadata managed by Databricks, but the underlying data resides in a location you specify in your storage account. When you drop an external table, only the metadata is removed; the data in the storage account remains untouched. External tables can be created on various file formats.

    • File Format: You've also stored data directly in your storage account in a specific file format (e.g., CSV, JSON, Parquet, Delta files not associated with an external table definition). This provides raw data storage that can be accessed by various tools and services.


    • Now we have to establish a secure connection between the databricks to storage account, and we have to manage the secure compute clusters as well as the databricks notebook

    • First we need to securely connect the workspace with the storage account in a secure manner, for which we have access keys Azure Storage Gen, which is not recommended because we feel it's like we are exposing our complete storage account and we have Shared access signature which is also not recommand because mostly only third party who want to access will come there.



    • The third option we have is Managed Identity is the recommended approach to connect securely. This can be done using IAM (access control ) is applicable for every resource inside Azure.

    • Azure entra ID is also there This was created when we automatically create the account.


    • By this entra id we can track entire thing the user are doing within the azure account .

    • You can access these logs and reports through the Microsoft Entra admin center (formerly Azure portal). Navigate to Entra ID > Monitoring & health. Here you'll find options for:

      • Audit logs
      • Sign-in logs
      • Provisioning logs
      • Usage and insights
      • Diagnostic settings (to configure where logs are sent for long-term storage and analysis)

      By regularly reviewing and analyzing these logs, you can gain valuable insights into user behavior, identify potential security risks, troubleshoot issues, and ensure compliance within your Azure environment.




    • Databricks is not part of Azure before the Databricks and storage account are connected using the Service principle, which is available inside the Microsoft extra ID.Later, Unity catalog comes to the picture, its not only securely connects the storage and Databricks but also manages all the resources in the Databricks securely.


  • Databricks access connector is used to connect Databricks to storage via managed identity, and Unity Catalog enablement and the creation of its default catalog are tied to the workspace.
  • Steps to create the Service principal since some of the projects are still using the service principal 
  • Microsoft Entra ID: You start by accessing the central identity management service for Microsoft Azure.
  • Default Directory: If you have multiple Azure Active Directory tenants, ensure you are in the correct one. "Default directory" usually refers to the primary tenant associated with your Azure subscription.
  • App registrations: Historically, Service Principals were managed under "App registrations". While the navigation might slightly differ in the latest Azure portal interface, the concept remains the same. You're looking for the registration of the application or service that the Service Principal represents.
  • Certificates & secrets: Once you've selected the specific App registration (which has an associated Service Principal), you navigate to the section where you can manage its credentials. This is typically labeled "Certificates & secrets" (or similar).
  • Creating Credentials: Here, you have two main options for authentication:
    • Certificates: You can upload your own X.509 certificate or have Azure generate one. Certificate-based authentication is often considered more secure for long-lived applications.
    • Client Secrets: These are password-like strings that Azure generates. They have an expiration date and need to be managed carefully.

  • below is the database nity controller 



  • once the storage account is connected with the databricks, unity catalog by default creates on storage account for monitoring purpose 
  • Databricks Workspace Storage:

    • When you create an Azure Databricks workspace, Azure automatically provisions a storage account that is used internally by Databricks. This storage account is where the Databricks File System (DBFS) root is located.
    • DBFS is a distributed file system mounted into your Databricks workspace and available on Databricks clusters. It's used for various purposes, including storing notebooks, libraries, experiment results, and data.  
    • This workspace storage account is typically separate from any storage accounts you might connect to explicitly for your data using Unity Catalog or access connectors.
    • The naming convention for this storage account often includes the workspace name and some unique identifiers.



  • Now we know how Databricks is securely connected with the storage account .In databricks , now we have to manually create the storage information under the catalog in workspace --> external data --> credential tab we need to give connector id in properties 


  • In order to access the container we need to move to near tab (external locations ) under that we need to register the container whichis present in storage account 

  • Creating the unity catalog 
  • step 1:

  • Go for Access connector for databricks :

  • the one is already there is default catalog in order to create new click on the create button .

  • Enter the resource group as well as project name in the name under instance details . Then review and create 








  • Go to project related storage lake account, then click add role then give storage blog contributor 



  • and add the created database access controller under the select members 

  • then review and Assign 

  • step 3: Regsister Databricks Access connector in workspace catalogue.

  • go to the created databricks access connector under properties, we can see the id 

  • go the catalog in the workspace and go to the external data tab .


  • give credentidal name and access connector id and then click create .

  • if you want to create new connection for external storage go to catalog then external location tab 



  • give the External location name as well as URL (copy and paster the URL in notepad and replace the container and storage account ) and for storage credential click required then click create .

  • then working lab (created )container link appears .



  • Then navigate to the databricks notebooks and place dbutils .fs.ls("file location of working labs ") and connect with the cluster and run


  • Likewise give it for bronze and silver , gold containers.



  • and create one new container to create to store all related storage required for create unity catalogue .



  • create the external location for the created cointainer.


  • step 6:

  • create project specific unity catalog 


  • Go to the catalog under the workspace, click + and give the name as project name (pricing_analysis) and the storage location that was created (pricing_analysis ) dedicated for this going to be created catalog.
  •  What is the advantages of Configuring Unity Catalogue?

    • Centralized Data Governance

    • Provides Fine Grained Access Control on Data and Data Assets

    • Can used across  multiple Databricks workspaces in Same Region

    • Data Lineage and Metadata captured at various levels

  •   What is meta store in Unity Catalogue?

    • Meta store is the underlying storage layer for all of the objects and data assets uses Unity Catalogue

  • Explain how to configure Unity Catalogue?

    • Just explain all of the steps we have done in this module

    • Create "Databricks Access Connector" Resource

    • Give "Storage Blob Data Contributor" access for Databricks Access Connector in Azure Data Lake Storage Account

    • Register "Databricks Access Connector" properties inside Workspace Catalogue

    • Link Storage account containers inside Workspace Catalogue

    • Create and Link Storage account container to store underlying data for all data objects created in unity catalogue

    • Create new catalogue and link its own storage location

    • To use them in the code always follow 3 level names catalogue.schema.table_name

  • Comments

    Popular posts from this blog

    session 19 Git Repository

      🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

    Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

     After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

    Transformation - section 6 - data flow

      Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...