azure

Introduction

structured data :

It will have the uniform data format throughout the file or database

eg : excel , csv , parquet , table

semistructured data :

Some will have the same type of data, and others are not

eg, if we collect the demographic details for multiple countries in one JSON file , UK will have a postcode and India will not have that field

Unstructured data:

We cannot predict what will be there in that file eg: HTML files and any media files

how we can process the unstructured data is first we need to convert the unstructured data into semi-structured data

That green box above is called the analysis phase. We need to know what source systems and type of data (VVV )

volume, variety, velocity (speed).

what data engineer do :

on perm

cloud :

A cloud provider has many data centers has many servers server is just like a separate machine, and user can access it through the front end (website) provided by the cloud provider

Data Analytics Terminologies

Let's break down some of these areas and connect them to your initial goal of building the Medallion Architecture on Azure:

Connecting the Dots to Your Data Lakehouse:

Source Systems & Data Formats: You'll encounter all these source systems and data formats feeding into your bronze layer in ADLS Gen2. The raw, unprocessed data will land here, regardless of its structure or origin.
ETL (Extract, Transform, Load): This is the core process that moves and refines data through your Medallion layers.
- Bronze to Silver: ETL processes will extract data from the bronze layer, transform it (cleaning, conforming, integrating), and load it into the silver layer.
- Silver to Gold: Further ETL processes will refine the silver layer data, transform it into business-ready models (potentially using dimensional modeling concepts from your OLAP section), and load it into the gold layer.
OLAP (Online Analytical Processing Systems), Datawarehouse, Datamart, Dimensional Modelling: While the Medallion Architecture on a data lakehouse is a modern approach, the principles of OLAP and dimensional modeling often influence how you structure your gold layer for efficient analysis. You might create denormalized tables (similar to star schemas) in the gold layer optimized for specific reporting or analytical use cases.
Incremental Loading & Change Data Capture (CDC): These are crucial techniques for efficiently updating your data lakehouse layers, especially as your source systems generate new or changed data. CDC helps identify and extract only the changes, while incremental loading applies these changes to your target layers.
Scheduling: Orchestration and scheduling tools (like Azure Data Factory or within Databricks) will be essential to automate your data ingestion and transformation pipelines, ensuring data flows consistently through the bronze, silver, and gold layers.
Code Deployment (CI/CD Pipeline): Implementing a CI/CD pipeline is vital for managing and deploying your data engineering code (e.g., Spark notebooks, data factory pipelines) in a reliable and automated manner.

Data Engineer Job Roles in this Context:

As a Data Engineer, you'll be heavily involved in:

Ingest: Building and maintaining pipelines to ingest data from various source systems into the bronze layer of your ADLS Gen2. This includes understanding the feasibility of connecting to different sources and handling various data formats.
Transform: Developing the data transformation logic (using tools like Databricks or Azure Data Factory) to cleanse, integrate, and shape the data as it moves from the bronze to the silver and finally to the gold layer. This requires understanding data transformation requirements and formalizing them into code.
Load: Ensuring the transformed data is efficiently loaded into the appropriate layers of your data lakehouse.
Collaboration: Working closely with Business Analysts to understand reporting needs, Data Architects to ensure the solution aligns with the overall architecture, Testers to ensure data quality, Data Analysts to understand their analytical requirements, and potentially AI/ML Engineers who might consume data from the gold layer.
Day-to-day Tasks: Your list accurately reflects the daily activities, including understanding requirements, asking clarifying questions, coding, unit testing, and developing CI/CD pipelines.

Data Analytics Architecture and Data Flow:

Your described data flow (Source -> Ingest -> Transform -> Publish -> Present) perfectly aligns with the Medallion Architecture:

Source: Your various OLTP systems and other data sources.
Ingest: Landing the raw data into the bronze layer.
Transform: Processing and refining data through the silver and gold layers.
Publish: Making the curated data in the gold layer available for consumption.
Present (Visualisation layer): Tools like Power BI or Tableau would connect to the gold layer for reporting and analysis.

Data Storage and Processing Service Selection:

File Based (Azure Data Lake Storage Gen2): Your choice for the data lake, ideal for storing the large volumes of data in all three Medallion layers.
Database (SQL Server Database): Might be a source system or could potentially be used for specific analytical workloads or a reporting database alongside your data lakehouse.
Data Processing Service Selection (Databricks, Azure Data Factory): Both are excellent choices for building your ETL/ELT pipelines. Databricks excels in its Spark-based processing capabilities, while Azure Data Factory offers a more visual, orchestration-focused approach. Often, they are used together.

Data Security Framework and Operational Support Plan: These are critical for any data platform and need careful consideration for your data lakehouse.

Cloud Fundamentals (IaaS, PaaS, SaaS): Understanding these models helps in choosing and managing the Azure services you'll be using. ADLS Gen2 and Azure Databricks are primarily PaaS offerings, where Microsoft manages the underlying infrastructure.

create the resource for databricks

We can also create any resource within the resource group.here we are creating storage account.

Use the standard name while creating eg: RSG - any project name - env

enable hirachicjhal structure

Now we created the databricks resource as well as azure datalake storage resource

We can see all the past activities under the notification icon

We need to create the containers to organize the data under the Datalakehouse storage .We will establish three distinct, top-level containers to implement the Medallion Architecture within our data lakehouse storage. These containers will correspond to the different layers of the architecture:

A bronze layer container will house the raw, unprocessed data ingested from our various sources.
A silver layer container will store cleaned, transformed, and integrated data, representing a refined version of the bronze layer data.
A gold layer container will contain highly curated and business-ready data, optimized for analysis, reporting, and consumption by end-users or applications."

let's walk through the Azure resource hierarchy and provisioning for your data lakehouse.

Azure Resources Hierarchy

You've correctly outlined the Azure resource hierarchy:

Tenant: Represents your organization in Azure. It's the top-level container and is associated with your Azure Active Directory.
Subscription: A logical container for your Azure resources. It's linked to a billing account and provides a boundary for resource management and cost control. You can have multiple subscriptions within a tenant.
Resource Group: A logical grouping of Azure resources that are related to a specific solution. Resource groups help you organize, manage, and apply policies to your resources collectively.
Resources: The individual Azure services you provision, such as Azure Data Lake Storage Gen2 accounts, Azure Databricks workspaces, databases, virtual machines, etc.

Azure Data Lake Storage Gen2 Provisioning

Azure Data Lake Storage Gen2 (ADLS Gen2) is built on Azure Blob Storage and provides a hierarchical namespace along with the scalability and cost-effectiveness of Blob Storage.

Blob vs. ADLS Gen2

Azure Blob Storage: Designed for storing massive amounts of unstructured data (blobs). It has a flat namespace, meaning objects are organized within containers but without a folder-like hierarchy at the storage level.
Azure Data Lake Storage Gen2: Offers all the capabilities of Blob Storage plus a hierarchical namespace. This allows you to organize objects in a logical, directory-like structure, which significantly improves data organization, management, and query performance, especially for big data analytics. ADLS Gen2 also offers enhanced security features and lower transaction costs for analytics workloads.

Hierarchical Namespace

The hierarchical namespace in ADLS Gen2 is a key differentiator. It enables:

Directory and File Structure: You can organize your data into folders and subfolders, making it easier to navigate and manage.
Atomic Operations: Operations like renaming or deleting a directory are atomic, which is crucial for data consistency in analytical pipelines.
Improved Performance: Certain big data processing frameworks can leverage the hierarchical structure for optimized data access.

Creating Containers

In the context of ADLS Gen2, "containers" are the top-level organizational units within your storage account, similar to the root of a file system. You will create one container within your ADLS Gen2 account. Then, within this container, you will create virtual directories (which appear as folders) named bronze, silver, and gold to represent your Medallion Architecture layers.

You can create containers and directories using various methods:

Azure Portal: A web-based interface for managing Azure resources.
Azure CLI: A command-line interface for interacting with Azure services.
PowerShell: A scripting language that can be used to manage Azure resources.
Azure Storage Explorer: A free, standalone application for working with Azure Storage data.
SDKs (e.g., Python, .NET, Java): Programmatic access to Azure Storage services.

Azure Databricks Provisioning

Azure Databricks is an Apache Spark-based analytics service that simplifies big data processing and machine learning.

Pricing Tiers

Azure Databricks offers several pricing tiers to suit different needs and budgets:

Standard: Provides a collaborative workspace with basic security and analytics capabilities. Suitable for smaller teams and less demanding workloads.
Premium: Offers advanced security features (like Azure Active Directory passthrough and customer-managed keys), enterprise-grade SLAs, and capabilities like Delta Lake and MLflow. Recommended for production environments and larger organizations.
Premium Trial: A time-limited trial of the Premium tier, allowing you to explore its advanced features.

Resource Provisioning Mandatory Inputs:

When provisioning an Azure Databricks workspace (and generally most Azure resources), you'll typically need to provide the following mandatory inputs:

Subscription Name: The Azure subscription where you want to create the resource.
Resource Group Name: The name of the resource group where the Databricks workspace will reside. You can either use an existing resource group or create a new one.
Region (Data Centre) Name: The geographical location (e.g., UK South, West Europe, East US) where your Databricks workspace and associated resources will be deployed. Choosing a region close to your data and users can improve performance and reduce latency.
Resource Name: A unique name for your Azure Databricks workspace within the chosen resource group.

Keerthana Blogs

Search This Blog

azure

Azure Resources Hierarchy

Azure Data Lake Storage Gen2 Provisioning

Blob vs. ADLS Gen2

Hierarchical Namespace

Creating Containers

Azure Databricks Provisioning

Pricing Tiers

Resource Provisioning Mandatory Inputs:

Labels

Comments

Post a Comment

Popular posts from this blog

session 19 Git Repository

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

Transformation - section 6 - data flow