Skip to main content

Hands on project for ADF

Azure Pipeline Creation and Configuration Steps

To log in to the Azure portal, a Microsoft account is required. After creating the account, sign in to the Azure portal and proceed with the following steps to build the data pipeline.


1. Create Resource Group and Storage Account

  • Create a Resource Group in Azure.
  • Under the resource group, create the required resources for the pipeline.

Azure Data Lake Storage (ADLS Gen2)

  • Create a Storage Account.
  • Enable Hierarchical Namespace to convert it into Data Lake Storage Gen2.
  • Inside the storage account:
    • Create a container (e.g., blog-container).
    • Organize data using folders/subfolders (can be created dynamically or manually using directory structure).

Storage Structure Example

Create a storage account named sdmm:

  • sdmm
    • gold
      • processing
      • sales
    • silver
      • sd
      • mm
    • bronze
      • sd
      • mm


2. Azure Data Factory (ADF) Setup

  • Create an Azure Data Factory instance.
  • Go to Managed Identities and enable System Assigned Managed Identity, then save.

Access Configuration

  • Navigate to the Storage Account → IAM (Access Control).
  • Assign roles:
    • Storage Blob Data Contributor
    • Assign this role to the Data Factory managed identity


3. Linked Service Configuration

  • Open ADF Studio.
  • Go to Manage → Linked Services.
  • Create a linked service connecting:
    • Azure Data Lake Storage Gen2
    • Azure Data Factory
  • Test the connection and ensure it succeeds.
  • Click Create to save the linked service.




THEN CLICK TEST CONNECTION IF THE CONNECTION IS SUCCESSFUL CLICK CREATE 
CREATE SUCCESFULLY .

4. Dataset Creation

  • Create a file in storage:
    • Example path: sap/mm/vendors.csv

Source Dataset

  • In ADF Studio: Author → Datasets → New
  • Choose:
    • Azure → Azure Data Lake Storage Gen2
  • Select the linked service created earlier.
  • Browse to the file path.
  • Set Import Schema = From Connection/Source (to retrieve metadata automatically).
  • Create dataset.

Sink Dataset

  • Create another dataset:
    • Type: Delimited Text
    • Name: delimitedtext_sink
.


in there for the linked service choose the linked service linked with gen 2 storage and then below file path field will get open then choose respective file name 
then choose for import schema : from connection or sorce so its takes the metadata from the  file location.

click ok thus the dat set will be created .
like create the sink dataset named delinitedtext_sink



5. Pipeline Creation

  • Create a new pipeline in ADF.
  • Add a Copy Activity (used for moving data between source and destination).

Configure:

  • Source dataset → ADLS Gen2 source file
  • Sink dataset → destination dataset

Then:

  • Click Validate All
  • Click Publish All

Run the pipeline:

  • Use Trigger → Trigger Now
  • If no errors occur, the pipeline executes successfully.

click the copy activity unde rthat tab give source and sink dataset .
then give validate all to validate everything and then publish all .

to sucessfully trigger give trigger --> trigger now option.

if there is no errors then the pipeline runs sucessfully .



6. Monitoring

  • Go to the Monitor tab in ADF.
  • Check pipeline execution status.
  • Verify that the file is successfully copied into the bronze layer folder.
PARAMETERIZED PIPELINE

its not the good to single pipeline for one time use . we have to parameterized the pipeline so that we can able to run as many pipeline constructed under single pipeline.
for that inside the pipeline we can find the parameter tab under that give the value .

    

7. Parameterized Pipeline (Best Practice)

Instead of creating separate pipelines for each file, use parameters to make the pipeline reusable.

Pipeline Parameters

  • p_fileName = vendors.csv
  • p_folderPath = bronze/mm/vendors

Dataset Parameters Mapping

In the dataset, define parameters and map them as:

  • fileName@pipeline().parameters.p_fileName
  • folderPath@pipeline().parameters.p_folderPath

Key Concept

Pipeline Parameter → passes values
Dataset Parameter → receives values




Best Practices (Industry Standard)

  • Pipeline controls the workflow logic
  • Dataset controls the data structure/location
  • Parameters connect pipeline and datasets dynamically
  • Always prefer parameterized pipelines for reusability and scalability


Always use storage browser for eaier access for inside the storage accounts.


Introduce REAL WORLD STRUCTURE (SAP MM + SD)

We now design proper enterprise folders:

    
├── sap/
│ ├── mm/
│ │ ├── vendors/
│ │ ├── materials/
│ │ ├── purchase_orders/
│ │
│ ├── sd/
│ ├── customers/
│ ├── sales_orders/
│ ├── deliveries/

Create the sub folder by clicking add dictionary under sd



in mm :



to create the meta table. we need sql database to store the table .
to creating the sql databse .





Created SQL database successfully.

Go to resourece group to check the databse have created sucessfully.
We can see two entities for database first one is logical database which is for host and connection and
other on eis for our use.

To create the table give to database under resource group and click query editor.
enter the credencials or microsoft entra auth click connect .



create the table .

CREATE TABLE dbo.ingestion_config
(
    config_id INT IDENTITY(1,1),
    source_system VARCHAR(50),
    entity_name VARCHAR(100),
    file_name VARCHAR(100),
    folder_path VARCHAR(200),
    load_type VARCHAR(20),
    is_active CHAR(1),
    created_date DATETIME DEFAULT GETDATE()
);

After creating Insert Your First Record (Vendors)
INSERT INTO dbo.ingestion_config
(
    source_system,
    entity_name,
    file_name,
    folder_path,
    load_type,
    is_active
)
VALUES
(
    'SAP_MM',
    'vendors',
    'vendors.csv',
    'sap/mm/vendors',
    'FULL',
    'Y'
);



one record was been created as the meta data table .


create the linked service for SQL database we have created earlier.

create the dataset for metatable and linked with the database linked service which is already created.




DS_METADATA_CONFIG dataset is created .

create the pipeline and create the lookup activity .
under that activity choose source dataset as dataset created linkd with sql database.




in lookup cong:
The Lookup activity is used to read data from a source and return it as output inside the pipeline.

Source tab:

  • Dataset: DS_METADATA_CONFIG

Query:

SELECT * 
FROM dbo.ingestion_config
WHERE is_active = 'Y';

Set:

First row only = FALSE

in for loop :

in setting at items :
@activity('Lookup1').output.value

inside the for loop add copy activity.

copy activity:

source dataset : database linked dataset.
sink dataset : create the dataset with adls to land.


🧠 SIMPLE FLOW (REMEMBER THIS)

SQL Metadata Table
        ↓
Lookup (reads all rows)
        ↓
ForEach (loops each row)
        ↓
Copy Activity
        ↓
ADLS dynamic path









the output saved inside bronze/mm/bronze/mm/vendors because i gave the folder name twice as
sink dataset as


and in folder name in the pipeline .

🧠 GOLDEN RULE (NEVER FORGET)

LayerWhat goes here
DatasetONLY container (bronze)
Copy Activityfolder + file logic

🚨 WHY YOU GOT DUPLICATION

ADF is doing:

Dataset path + Copy Activity path

So:

bronze/mm + bronze/mm/vendors_runtime
= bronze/mm/bronze/mm/vendors_runtime

I implemented a metadata-driven ingestion framework in Azure Data Factory
using Azure SQL Database as the control layer. A metadata table was created
in Azure SQL containing source system details, file names,
folder paths, and load configurations for SAP MM entities like vendors.
In ADF, a linked service and dataset were created to connect to the SQL metadata table.
A Lookup activity is used to read active records from the metadata table using a SQL query.
The output of the Lookup is passed to a ForEach activity, which iterates through each metadata record.
Inside the ForEach loop, a Copy Data activity is used to dynamically ingest data from the source to Azure Data Lake Storage Gen2.
The folder path and file name are parameterized using expressions like @item().folder_path and @item().file_name, enabling dynamic
and reusable data ingestion without hardcoding values.
This approach allows scalable onboarding of new data sources by simply updating metadata, without modifying the pipeline logic.

Fix + Scale Your Pipeline (Multi-Entity Design)

Right now you only ingest:

  • vendors.csv

Next, we make it generic for SAP MM + SAP SD

add three more rows in database for the materials , sales , customers




add the parameter in the dataset and customize in the path in the file path in the connection tab in dataset .

and add the @item().filepath and @item().folderpath.






ExpressionMeaning
@activity('Lookup1').output.valueAll rows returned by Lookup
@item()Current row inside ForEach
@item().folder_pathfolder_path column from current row
@item().file_namefile_name column from current row
@dataset().p_folderPathDataset parameter value passed from Copy Activity
@dataset().p_fileNameDataset parameter value passed from Copy Activity.


Next Hands-on Task: Add Databricks Notebook Activity to your ADF pipeline

You’re moving from data movement → data processing.


Extend your current pipeline:

Lookup → ForEach → Copy Activity (Bronze load)

to:

Lookup → ForEach → Copy Activity → Databricks Notebook Activity

creating the databricks workspace .

choose the workspace as hydrid instead of serverless . we need to practice it like enterprice .

or else inside the workspace we have only option to create sql warehopuse.

after changing and creating the databricks name sdmm-databricks1.

after creating we can see two things outside of my resource group.

NetworkWatcherRG


Azure Databricks Resource Group & Cluster Architecture Notes

🔷 Overview

In Databricks on Azure, resources are not fully contained inside a single Resource Group. Instead, Azure Databricks uses a split architecture with both user-managed and system-managed components.


🔷 Resource Group Types

1. User-Defined Resource Group (Primary RG)

This is the resource group created by the user when provisioning Databricks workspace.

Contains:

  • Databricks Workspace
  • Workspace configuration
  • Linked services (some cases)

Example:

  • databricks-rg-de-learning

2. Databricks-Managed Resource Group (Hidden/System RG)

When clusters are created, Azure automatically creates additional infrastructure in a managed resource group.

Naming pattern:

  • databricks-rg-<workspace-name>-<random-string>

Contains:

  • Cluster VMs (Driver & Worker nodes)
  • Temporary compute infrastructure
  • Disks / managed identities (in some cases)

Key Point:
👉 These resources are auto-created and auto-managed by Databricks


3. Azure System Resource Group

Example:

  • NetworkWatcherRG

Contains:

  • Network Watcher
  • NSG flow logs
  • Network diagnostics tools

Key Point:
👉 This is created automatically by Azure subscription, not Databricks


🔷 Databricks Architecture (Important for Interviews)

Databricks follows a Control Plane + Data Plane architecture

1. Control Plane (Managed by Databricks)

  • Workspace UI
  • Job orchestration
  • Notebook management
  • Cluster control APIs

2. Data Plane (Runs in Azure Subscription)

  • Spark clusters (VMs)
  • Storage access (ADLS / Blob)
  • Data processing workloads

🔷 Why Resources Appear Outside Your Resource Group

Reason:

Azure Databricks dynamically creates infrastructure in different resource groups for:

  • Cluster scalability
  • Isolation of compute resources
  • Security separation
  • Auto-scaling and lifecycle management

🔷 Key Interview Points

  • Databricks uses separate control plane and data plane architecture
  • Cluster compute is deployed in Databricks-managed resource groups
  • User resource group does NOT contain all runtime infrastructure
  • NetworkWatcherRG is Azure system-managed, not Databricks-related
  • Compute resources are ephemeral and auto-provisioned

🔷 One-Liner Summary (Very Important)

In Azure Databricks, the workspace resides in the user resource group,

 

but compute clusters are created in Databricks-managed resource groups within the Azure data plane,

 

while Azure system services like Network Watcher exist in separate subscription-level resource groups


Step 1 : creating the compute.

✅ FINAL RECOMMENDED CONFIG (BEST FOR YOU)

Use this:

  • Compute name: learning-cluster
  • Policy: Unrestricted
  • Runtime: 17.3 LTS (OK) or 15.x LTS (safer)
  • Photon: ON
  • Node type: Standard_DC4as_v5 (OK)
  • Mode: Single Node
  • Auto terminate: 120 mins
  • Access mode: Single User

🧠 Interview insight (important)

You can say:

“A single-node cluster in Databricks is used for development and learning,

where driver and worker run on the same VM. In production,

multi-node clusters are used for distributed processing with separate driver and worker

nodes.” 




🔷 Key Difference (Interview Table)

FeatureDEV (Serverless)PROD (Dedicated Compute)
PurposeExperimentationProduction pipelines
SetupNoneConfig required
ScalingAutomaticControlled
Cost modelPay per useOptimized for long runs
Performance tuningLimitedFull control
ReliabilityMediumHigh (SLA-driven)


Connecting ADLS with databricks.

Here’s a clean interview-ready summary you can copy and paste:


✅ Azure Databricks → ADLS Gen2 (Unity Catalog) Connection Summary

To connect Azure Databricks to ADLS Gen2 using Unity Catalog, we use Azure Managed Identity with Access Connector, which removes the need for storing secrets or keys.

🔷 Step 1: Azure Setup

  • Create an Azure Access Connector for Azure Databricks

  • Ensure System-assigned Managed Identity = ON

  • In the ADLS Gen2 Storage Account (IAM), assign:

    • Storage Blob Data Contributor role

    • To the Access Connector’s managed identity





🔷 Step 2: Create Storage Credential (Databricks)

In Databricks Catalog Explorer:

  • Create a Storage Credential

  • Type: Azure Managed Identity

  • Provide Access Connector Resource ID from Azure:

/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/accessConnectors/<connector-name>

🔷 Step 3: Create External Location

Map ADLS path in Unity Catalog:

CREATE EXTERNAL LOCATION bronze_loc
URL 'abfss://bronze@<storage-account>.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL adls_credential);



show storage credentials; CREATE EXTERNAL LOCATION bronze_loc URL 'abfss://bronze@sdmm.dfs.core.windows.net/' WITH (STORAGE CREDENTIAL adls_credential;

🔷 Step 4: Grant Permissions

Grant access at Unity Catalog level:

GRANT READ FILES, WRITE FILES
ON EXTERNAL LOCATION bronze_loc
TO `data_engineers`;

(Only owner or metastore admin can execute this.)


🔷 Key URL Format (ADLS Gen2)

abfss://<container>@<storage-account>.dfs.core.windows.net/

Example:

abfss://bronze@datalakeprod01.dfs.core.windows.net/

🔷 ADF Role (Orchestration)

  • ADF is used for pipeline orchestration

  • It triggers Databricks notebooks and moves data

  • It does NOT manage storage connectivity directly


🔥 Key Point (Interview Statement)

Azure Databricks connects to ADLS Gen2 using Unity Catalog with Azure Managed Identity via an Access Connector.

 We create a storage credential using the connector resource ID, define an external location pointing to the ADLS container,

 and manage access using Unity Catalog permissions.

This ensures secure, keyless, and governed data access.










Comments

Popular posts from this blog

Entity Relationship (ER) Diagram Model with DBMS Example

Reference :   Entity Relationship (ER) Diagram Model with DBMS Example What is ER Diagram? ER Diagram  stands for Entity Relationship Diagram, also known as ERD is a diagram that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes and relationships. ER Diagrams contain different symbols that use rectangles to represent entities, ovals to define attributes and diamond shapes to represent relationships. At first look, an ER diagram looks very similar to the flowchart. However, ER Diagram includes many specialized symbols, and its meanings make this model unique. The purpose of ER Diagram is to represent the entity framework infrastructure. Entity Relationship Diagram Example Table of Content: What is ER Diagram? What is ER Model? History of ER models Why use ER Diagrams? Facts about ER Diagram Model ER Diagram...

SQL Joins and advanced joins and Subqueries

  Refernce :  Expert Guide to Advanced SQL Joins: What You Need to Know It's helpful to visualize how these different SQL joins work. Here's a breakdown in a table-like format, along with explanations: SQL Join Types Overview Join Type Description Key Characteristics Use Cases INNER JOIN Returns rows where there is a match in both tables. - Shows only matching records. - Excludes unmatched rows from both tables. - Retrieving related data that exists in both tables. - Finding records with corresponding entries in another table. LEFT OUTER JOIN (LEFT JOIN) Returns all rows from the left table, and the matched rows from the right table. - Includes all records from the left table. - Fills in NULL values for columns from the right table where there's no match. - Retrieving all records from one table and their related data from another, even if some records don't have matches. - Finding records in one table that don't have corresponding entries in another. RIGHT OUTER JO...

GIT BASH

  Bash Shell: Git Bash uses the Bash (Bourne Again SHell) command-line interpreter. This means you can use many of the same commands you'd find in a Linux or macOS terminal. Git Integration: Git Bash is tightly integrated with Git, making it easy to execute Git commands Essential Commands: Navigation: pwd : Prints the current working directory. ls : Lists files and directories in the current directory. cd <directory> : Changes the current directory. cd .. : Moves to the parent directory. File Management: mkdir <directory> : Creates a new directory. touch <file> : Creates a new file. rm <file> : Removes a file. rmdir <directory> : Removes an empty directory. Git Commands: git init : Initializes a new Git repository. git clone <repository URL> : Clones an existing Git repository. git status : Displays the status of your working directory. git add <file> : Adds a file to the staging area. git commit -m "commit message" : Commits chan...