Unity Catalog is a unified governance solution for data and AI on the Databricks platform.
What we have done is
Workspace Creation (Databricks): You've established your collaborative environment within the Databricks platform. This is where you and your team can develop and run data engineering, data science, and machine learning workflows.
Notebook Usage: You've utilized Databricks notebooks as your primary interface for writing and executing code. Notebooks are interactive environments that allow you to combine code (Python, Scala, SQL, R), visualizations, and documentation.
Cluster Configuration (Spark Processing Engine): You've set up the computational resources needed to process your data using Apache Spark. This involves defining the size and configuration of your Spark clusters, which will distribute the workload across multiple nodes for parallel processing.
Output Storage: You've persisted the results of your Spark processing in different forms within a storage account (likely Azure Data Lake Storage Gen2 or a similar service, given the context of Databricks on Azure):
-
Managed Tables: These tables are fully managed by Databricks. The data and metadata are stored in the default managed location associated with your Databricks workspace or Unity Catalog metastore. When you drop a managed table, both the schema and the underlying data are deleted. These tables typically use the Delta Lake format by default.
-
External Tables: These tables have their metadata managed by Databricks, but the underlying data resides in a location you specify in your storage account. When you drop an external table, only the metadata is removed; the data in the storage account remains untouched. External tables can be created on various file formats.
-
File Format: You've also stored data directly in your storage account in a specific file format (e.g., CSV, JSON, Parquet, Delta files not associated with an external table definition). This provides raw data storage that can be accessed by various tools and services.
Now we have to establish a secure connection between the databricks to storage account, and we have to manage the secure compute clusters as well as the databricks notebookFirst we need to securely connect the workspace with the storage account in a secure manner, for which we have access keys Azure Storage Gen, which is not recommended because we feel it's like we are exposing our complete storage account and we have Shared access signature which is also not recommand because mostly only third party who want to access will come there.
The third option we have is Managed Identity is the recommended approach to connect securely. This can be done using IAM (access control ) is applicable for every resource inside Azure.
Azure entra ID is also there This was created when we automatically create the account.
By this entra id we can track entire thing the user are doing within the azure account .You can access these logs and reports through the Microsoft Entra admin center (formerly Azure portal). Navigate to Entra ID > Monitoring & health. Here you'll find options for:
- Audit logs
- Sign-in logs
- Provisioning logs
- Usage and insights
- Diagnostic settings (to configure where logs are sent for long-term storage and analysis)
By regularly reviewing and analyzing these logs, you can gain valuable insights into user behavior, identify potential security risks, troubleshoot issues, and ensure compliance within your Azure environment.
Databricks is not part of Azure before the Databricks and storage account are connected using the Service principle, which is available inside the Microsoft extra ID.Later, Unity catalog comes to the picture, its not only securely connects the storage and Databricks but also manages all the resources in the Databricks securely.
- Certificates: You can upload your own X.509 certificate or have Azure generate one. Certificate-based authentication is often considered more secure for long-lived applications.
- Client Secrets: These are password-like strings that Azure generates. They have an expiration date and need to be managed carefully.
once the storage account is connected with the databricks, unity catalog by default creates on storage account for monitoring purpose
Databricks Workspace Storage:
- When you create an Azure Databricks workspace, Azure automatically provisions a storage account that is used internally by Databricks. This storage account is where the Databricks File System (DBFS) root is located.
- DBFS is a distributed file system mounted into your Databricks workspace and available on Databricks clusters.
It's used for various purposes, including storing notebooks, libraries, experiment results, and data. - This workspace storage account is typically separate from any storage accounts you might connect to explicitly for your data using Unity Catalog or access connectors.
- The naming convention for this storage account often includes the workspace name and some unique identifiers.
the one is already there is default catalog in order to create new click on the create button .
Enter the resource group as well as project name in the name under instance details . Then review and create
go the catalog in the workspace and go to the external data tab .
give credentidal name and access connector id and then click create .
give the External location name as well as URL (copy and paster the URL in notepad and replace the container and storage account ) and for storage credential click required then click create .
step 6:
Go to the catalog under the workspace, click + and give the name as project name (pricing_analysis) and the storage location that was created (pricing_analysis ) dedicated for this going to be created catalog.
What is the advantages of Configuring Unity Catalogue?
Centralized Data Governance
Provides Fine Grained Access Control on Data and Data Assets
Can used across multiple Databricks workspaces in Same Region
Data Lineage and Metadata captured at various levels
What is meta store in Unity Catalogue?
Meta store is the underlying storage layer for all of the objects and data assets uses Unity Catalogue
Explain how to configure Unity Catalogue?
Just explain all of the steps we have done in this module
Create "Databricks Access Connector" Resource
Give "Storage Blob Data Contributor" access for Databricks Access Connector in Azure Data Lake Storage Account
Register "Databricks Access Connector" properties inside Workspace Catalogue
Link Storage account containers inside Workspace Catalogue
Create and Link Storage account container to store underlying data for all data objects created in unity catalogue
Create new catalogue and link its own storage location
To use them in the code always follow 3 level names catalogue.schema.table_name
Comments
Post a Comment