Skip to main content

azure

 Introduction 


structured data :

It will have the uniform data format throughout the file or database 
eg : excel , csv , parquet , table 

semistructured data : 
Some will have the same type of data, and others are not 
eg, if we collect the demographic details for multiple countries in one JSON file , UK will have a postcode and India will not have that field 

Unstructured data:
We cannot predict what will be there in that file eg: HTML files and any media files 

how we can process the unstructured data is first we need to convert the unstructured data into semi-structured data 



That green box above is called the analysis phase. We need to know what source systems and type of data (VVV )
volume, variety, velocity (speed).

what data engineer do :









on perm 


cloud :

A cloud provider has many data centers has many servers server is just like a separate machine, and user can access it through the front end (website) provided by the cloud provider 


Data Analytics Terminologies

Let's break down some of these areas and connect them to your initial goal of building the Medallion Architecture on Azure:

Connecting the Dots to Your Data Lakehouse:

  • Source Systems & Data Formats: You'll encounter all these source systems and data formats feeding into your bronze layer in ADLS Gen2. The raw, unprocessed data will land here, regardless of its structure or origin.
  • ETL (Extract, Transform, Load): This is the core process that moves and refines data through your Medallion layers.
    • Bronze to Silver: ETL processes will extract data from the bronze layer, transform it (cleaning, conforming, integrating), and load it into the silver layer.
    • Silver to Gold: Further ETL processes will refine the silver layer data, transform it into business-ready models (potentially using dimensional modeling concepts from your OLAP section), and load it into the gold layer.
  • OLAP (Online Analytical Processing Systems), Datawarehouse, Datamart, Dimensional Modelling: While the Medallion Architecture on a data lakehouse is a modern approach, the principles of OLAP and dimensional modeling often influence how you structure your gold layer for efficient analysis. You might create denormalized tables (similar to star schemas) in the gold layer optimized for specific reporting or analytical use cases.
  • Incremental Loading & Change Data Capture (CDC): These are crucial techniques for efficiently updating your data lakehouse layers, especially as your source systems generate new or changed data. CDC helps identify and extract only the changes, while incremental loading applies these changes to your target layers.
  • Scheduling: Orchestration and scheduling tools (like Azure Data Factory or within Databricks) will be essential to automate your data ingestion and transformation pipelines, ensuring data flows consistently through the bronze, silver, and gold layers.
  • Code Deployment (CI/CD Pipeline): Implementing a CI/CD pipeline is vital for managing and deploying your data engineering code (e.g., Spark notebooks, data factory pipelines) in a reliable and automated manner.

Data Engineer Job Roles in this Context:

As a Data Engineer, you'll be heavily involved in:

  • Ingest: Building and maintaining pipelines to ingest data from various source systems into the bronze layer of your ADLS Gen2. This includes understanding the feasibility of connecting to different sources and handling various data formats.
  • Transform: Developing the data transformation logic (using tools like Databricks or Azure Data Factory) to cleanse, integrate, and shape the data as it moves from the bronze to the silver and finally to the gold layer. This requires understanding data transformation requirements and formalizing them into code.
  • Load: Ensuring the transformed data is efficiently loaded into the appropriate layers of your data lakehouse.
  • Collaboration: Working closely with Business Analysts to understand reporting needs, Data Architects to ensure the solution aligns with the overall architecture, Testers to ensure data quality, Data Analysts to understand their analytical requirements, and potentially AI/ML Engineers who might consume data from the gold layer.
  • Day-to-day Tasks: Your list accurately reflects the daily activities, including understanding requirements, asking clarifying questions, coding, unit testing, and developing CI/CD pipelines.

Data Analytics Architecture and Data Flow:

Your described data flow (Source -> Ingest -> Transform -> Publish -> Present) perfectly aligns with the Medallion Architecture:

  • Source: Your various OLTP systems and other data sources.
  • Ingest: Landing the raw data into the bronze layer.
  • Transform: Processing and refining data through the silver and gold layers.
  • Publish: Making the curated data in the gold layer available for consumption.
  • Present (Visualisation layer): Tools like Power BI or Tableau would connect to the gold layer for reporting and analysis.

Data Storage and Processing Service Selection:

  • File Based (Azure Data Lake Storage Gen2): Your choice for the data lake, ideal for storing the large volumes of data in all three Medallion layers.
  • Database (SQL Server Database): Might be a source system or could potentially be used for specific analytical workloads or a reporting database alongside your data lakehouse.
  • Data Processing Service Selection (Databricks, Azure Data Factory): Both are excellent choices for building your ETL/ELT pipelines. Databricks excels in its Spark-based processing capabilities, while Azure Data Factory offers a more visual, orchestration-focused approach. Often, they are used together.

Data Security Framework and Operational Support Plan: These are critical for any data platform and need careful consideration for your data lakehouse.

Cloud Fundamentals (IaaS, PaaS, SaaS): Understanding these models helps in choosing and managing the Azure services you'll be using. ADLS Gen2 and Azure Databricks are primarily PaaS offerings, where Microsoft manages the underlying infrastructure.

login to Azure portal go to the website portal.azure.com

create the resource for databricks 





We can also create any resource within the resource group.here we are creating storage account.





Use the standard name while creating eg: RSG - any project name - env
enable hirachicjhal structure 



Now we created the databricks resource as well as azure datalake storage resource 


We can see all the past activities under the notification icon 

We need to create the containers to organize the data under the Datalakehouse storage .We will establish three distinct, top-level containers to implement the Medallion Architecture within our data lakehouse storage. These containers will correspond to the different layers of the architecture:

  • A bronze layer container will house the raw, unprocessed data ingested from our various sources.
  • A silver layer container will store cleaned, transformed, and integrated data, representing a refined version of the bronze layer data.
  • A gold layer container will contain highly curated and business-ready data, optimized for analysis, reporting, and consumption by end-users or applications."

 let's walk through the Azure resource hierarchy and provisioning for your data lakehouse.

Azure Resources Hierarchy

You've correctly outlined the Azure resource hierarchy:

  • Tenant: Represents your organization in Azure. It's the top-level container and is associated with your Azure Active Directory.
  • Subscription: A logical container for your Azure resources. It's linked to a billing account and provides a boundary for resource management and cost control. You can have multiple subscriptions within a tenant.
  • Resource Group: A logical grouping of Azure resources that are related to a specific solution. Resource groups help you organize, manage, and apply policies to your resources collectively.
  • Resources: The individual Azure services you provision, such as Azure Data Lake Storage Gen2 accounts, Azure Databricks workspaces, databases, virtual machines, etc.

Azure Data Lake Storage Gen2 Provisioning

Azure Data Lake Storage Gen2 (ADLS Gen2) is built on Azure Blob Storage and provides a hierarchical namespace along with the scalability and cost-effectiveness of Blob Storage.

Blob vs. ADLS Gen2

  • Azure Blob Storage: Designed for storing massive amounts of unstructured data (blobs). It has a flat namespace, meaning objects are organized within containers but without a folder-like hierarchy at the storage level.
  • Azure Data Lake Storage Gen2: Offers all the capabilities of Blob Storage plus a hierarchical namespace. This allows you to organize objects in a logical, directory-like structure, which significantly improves data organization, management, and query performance, especially for big data analytics. ADLS Gen2 also offers enhanced security features and lower transaction costs for analytics workloads.

Hierarchical Namespace

The hierarchical namespace in ADLS Gen2 is a key differentiator. It enables:

  • Directory and File Structure: You can organize your data into folders and subfolders, making it easier to navigate and manage.
  • Atomic Operations: Operations like renaming or deleting a directory are atomic, which is crucial for data consistency in analytical pipelines.
  • Improved Performance: Certain big data processing frameworks can leverage the hierarchical structure for optimized data access.

Creating Containers

In the context of ADLS Gen2, "containers" are the top-level organizational units within your storage account, similar to the root of a file system. You will create one container within your ADLS Gen2 account. Then, within this container, you will create virtual directories (which appear as folders) named bronze, silver, and gold to represent your Medallion Architecture layers.

You can create containers and directories using various methods:

  • Azure Portal: A web-based interface for managing Azure resources.
  • Azure CLI: A command-line interface for interacting with Azure services.
  • PowerShell: A scripting language that can be used to manage Azure resources.
  • Azure Storage Explorer: A free, standalone application for working with Azure Storage data.
  • SDKs (e.g., Python, .NET, Java): Programmatic access to Azure Storage services.

Azure Databricks Provisioning

Azure Databricks is an Apache Spark-based analytics service that simplifies big data processing and machine learning.

Pricing Tiers

Azure Databricks offers several pricing tiers to suit different needs and budgets:

  • Standard: Provides a collaborative workspace with basic security and analytics capabilities. Suitable for smaller teams and less demanding workloads.
  • Premium: Offers advanced security features (like Azure Active Directory passthrough and customer-managed keys), enterprise-grade SLAs, and capabilities like Delta Lake and MLflow. Recommended for production environments and larger organizations.
  • Premium Trial: A time-limited trial of the Premium tier, allowing you to explore its advanced features.

Resource Provisioning Mandatory Inputs:

When provisioning an Azure Databricks workspace (and generally most Azure resources), you'll typically need to provide the following mandatory inputs:

  • Subscription Name: The Azure subscription where you want to create the resource.
  • Resource Group Name: The name of the resource group where the Databricks workspace will reside. You can either use an existing resource group or create a new one.
  • Region (Data Centre) Name: The geographical location (e.g., UK South, West Europe, East US) where your Databricks workspace and associated resources will be deployed. Choosing a region close to your data and users can improve performance and reduce latency.
  • Resource Name: A unique name for your Azure Databricks workspace within the chosen resource group.

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...