Hands on project for ADF

Azure Pipeline Creation and Configuration Steps

To log in to the Azure portal, a Microsoft account is required. After creating the account, sign in to the Azure portal and proceed with the following steps to build the data pipeline.

1. Create Resource Group and Storage Account

Create a Resource Group in Azure.
Under the resource group, create the required resources for the pipeline.

Azure Data Lake Storage (ADLS Gen2)

Create a Storage Account.
Enable Hierarchical Namespace to convert it into Data Lake Storage Gen2.
Inside the storage account:
- Create a container (e.g., blog-container).
- Organize data using folders/subfolders (can be created dynamically or manually using directory structure).

Storage Structure Example

Create a storage account named sdmm:

sdmm
- gold
  - processing
  - sales
- silver
  - sd
  - mm
- bronze
  - sd
  - mm

2. Azure Data Factory (ADF) Setup

Create an Azure Data Factory instance.
Go to Managed Identities and enable System Assigned Managed Identity, then save.

Access Configuration

Navigate to the Storage Account → IAM (Access Control).
Assign roles:
- Storage Blob Data Contributor
- Assign this role to the Data Factory managed identity

3. Linked Service Configuration

Open ADF Studio.
Go to Manage → Linked Services.
Create a linked service connecting:
- Azure Data Lake Storage Gen2
- Azure Data Factory
Test the connection and ensure it succeeds.
Click Create to save the linked service.

THEN CLICK TEST CONNECTION IF THE CONNECTION IS SUCCESSFUL CLICK CREATE

CREATE SUCCESFULLY .

4. Dataset Creation

Create a file in storage:
- Example path: sap/mm/vendors.csv

Source Dataset

In ADF Studio: Author → Datasets → New
Choose:
- Azure → Azure Data Lake Storage Gen2
Select the linked service created earlier.
Browse to the file path.
Set Import Schema = From Connection/Source (to retrieve metadata automatically).
Create dataset.

Sink Dataset

Create another dataset:
- Type: Delimited Text
- Name: delimitedtext_sink

in there for the linked service choose the linked service linked with gen 2 storage and then below file path field will get open then choose respective file name

then choose for import schema : from connection or sorce so its takes the metadata from the file location.

click ok thus the dat set will be created .

like create the sink dataset named delinitedtext_sink

5. Pipeline Creation

Create a new pipeline in ADF.
Add a Copy Activity (used for moving data between source and destination).

Configure:

Source dataset → ADLS Gen2 source file
Sink dataset → destination dataset

Then:

Click Validate All
Click Publish All

Run the pipeline:

Use Trigger → Trigger Now
If no errors occur, the pipeline executes successfully.

click the copy activity unde rthat tab give source and sink dataset .

then give validate all to validate everything and then publish all .

to sucessfully trigger give trigger --> trigger now option.

if there is no errors then the pipeline runs sucessfully .

6. Monitoring

Go to the Monitor tab in ADF.
Check pipeline execution status.
Verify that the file is successfully copied into the bronze layer folder.

PARAMETERIZED PIPELINE

its not the good to single pipeline for one time use . we have to parameterized the pipeline so that we can able to run as many pipeline constructed under single pipeline.

for that inside the pipeline we can find the parameter tab under that give the value .

7. Parameterized Pipeline (Best Practice)

Instead of creating separate pipelines for each file, use parameters to make the pipeline reusable.

Pipeline Parameters

p_fileName = vendors.csv
p_folderPath = bronze/mm/vendors

Dataset Parameters Mapping

In the dataset, define parameters and map them as:

fileName → @pipeline().parameters.p_fileName
folderPath → @pipeline().parameters.p_folderPath

Key Concept

Pipeline Parameter → passes values
Dataset Parameter → receives values

Best Practices (Industry Standard)

Pipeline controls the workflow logic
Dataset controls the data structure/location
Parameters connect pipeline and datasets dynamically
Always prefer parameterized pipelines for reusability and scalability

Always use storage browser for eaier access for inside the storage accounts.

Introduce REAL WORLD STRUCTURE (SAP MM + SD)

We now design proper enterprise folders:


    
 ├── sap/
 │    ├── mm/
 │    │     ├── vendors/
 │    │     ├── materials/
 │    │     ├── purchase_orders/
 │    │
 │    ├── sd/
 │          ├── customers/
 │          ├── sales_orders/
 │          ├── deliveries/

Create the sub folder by clicking add dictionary under  sd 



in mm :



to create the meta table. we need sql database to store the table .
to creating the sql databse .





Created SQL database successfully.

Go to resourece group to check the databse have created sucessfully.
We can see two entities for database first one is logical database which is for host and connection and 
other on eis for our use.

To create the table give to database under resource group and click query editor.
enter the credencials or microsoft entra auth click connect .



create the table .

CREATE TABLE dbo.ingestion_config
(
    config_id INT IDENTITY(1,1),
    source_system VARCHAR(50),
    entity_name VARCHAR(100),
    file_name VARCHAR(100),
    folder_path VARCHAR(200),
    load_type VARCHAR(20),
    is_active CHAR(1),
    created_date DATETIME DEFAULT GETDATE()
);

After creating  Insert Your First Record (Vendors)
INSERT INTO dbo.ingestion_config
(
    source_system,
    entity_name,
    file_name,
    folder_path,
    load_type,
    is_active
)
VALUES
(
    'SAP_MM',
    'vendors',
    'vendors.csv',
    'sap/mm/vendors',
    'FULL',
    'Y'
);



one record was been created as the meta data table .


create the linked service for SQL database we have created earlier.

create the dataset for metatable and linked with the database linked service which is already created.




DS_METADATA_CONFIG dataset is created .

create the pipeline and create the lookup activity .
under that activity choose source dataset as dataset created linkd with sql database.




in lookup cong:
The Lookup activity is used to read data from a source and return it as output inside the pipeline.
Source tab:
Dataset: DS_METADATA_CONFIG
Query:
SELECT * 
FROM dbo.ingestion_config
WHERE is_active = 'Y';
Set:
First row only = FALSE

in for loop :

in setting at items :@activity('Lookup1').output.value

inside the for loop add copy activity.

copy activity:

source dataset : database linked dataset.
sink dataset : create the dataset with adls to land.


🧠 SIMPLE FLOW (REMEMBER THIS)

SQL Metadata Table
        ↓
Lookup (reads all rows)
        ↓
ForEach (loops each row)
        ↓
Copy Activity
        ↓
ADLS dynamic path









the output saved inside bronze/mm/bronze/mm/vendors because i gave the folder name twice as 
sink dataset as 


and in folder name in the pipeline .

🧠 GOLDEN RULE (NEVER FORGET)
Layer What goes here
Dataset ONLY container (bronze)
Copy Activity folder + file logic
🚨 WHY YOU GOT DUPLICATION
ADF is doing:
Dataset path + Copy Activity path
So:
bronze/mm + bronze/mm/vendors_runtime
= bronze/mm/bronze/mm/vendors_runtime

I implemented a metadata-driven ingestion framework in Azure Data Factory 
using Azure SQL Database as the control layer. A metadata table was created 
in Azure SQL containing source system details, file names, 
folder paths, and load configurations for SAP MM entities like vendors. 
In ADF, a linked service and dataset were created to connect to the SQL metadata table.
 A Lookup activity is used to read active records from the metadata table using a SQL query. 
The output of the Lookup is passed to a ForEach activity, which iterates through each metadata record.
 Inside the ForEach loop, a Copy Data activity is used to dynamically ingest data from the source to Azure Data Lake Storage Gen2. 
The folder path and file name are parameterized using expressions like @item().folder_path and @item().file_name, enabling dynamic
 and reusable data ingestion without hardcoding values. 
This approach allows scalable onboarding of new data sources by simply updating metadata, without modifying the pipeline logic.

Fix + Scale Your Pipeline (Multi-Entity Design)

Right now you only ingest:
vendors.csv
Next, we make it generic for SAP MM + SAP SD
add three more rows in database for the materials , sales , customers



add the parameter in the dataset and customize in the path in the file path in the connection tab in dataset .
and add the @item().filepath and @item().folderpath.





Expression Meaning
@activity('Lookup1').output.value All rows returned by Lookup
@item() Current row inside ForEach
@item().folder_path folder_path column from current row
@item().file_name file_name column from current row
@dataset().p_folderPath Dataset parameter value passed from Copy Activity
@dataset().p_fileName Dataset parameter value passed from Copy Activity.


Next Hands-on Task: Add Databricks Notebook Activity to your ADF pipeline

You’re moving from data movement → data processing.

Extend your current pipeline:

Lookup → ForEach → Copy Activity (Bronze load)

to:
Lookup → ForEach → Copy Activity → Databricks Notebook Activity

creating the databricks workspace .

choose the workspace as hydrid instead of serverless . we need to practice it like enterprice .

or else inside the workspace we have only option to create sql warehopuse.

after changing and creating the databricks name sdmm-databricks1.

after creating we can see two things outside of my resource group.
databricks-rg-de-learning-cluster1-l3opw2gvejfis 


NetworkWatcherRG

Azure Databricks Resource Group & Cluster Architecture Notes

🔷 Overview

In Databricks on Azure, resources are not fully contained inside a single Resource Group. Instead, Azure Databricks uses a split architecture with both user-managed and system-managed components.


🔷 Resource Group Types

1. User-Defined Resource Group (Primary RG)

This is the resource group created by the user when provisioning Databricks workspace.

Contains:



Databricks Workspace


Workspace configuration


Linked services (some cases)


Example:



databricks-rg-de-learning



2. Databricks-Managed Resource Group (Hidden/System RG)

When clusters are created, Azure automatically creates additional infrastructure in a managed resource group.

Naming pattern:



databricks-rg-<workspace-name>-<random-string>


Contains:



Cluster VMs (Driver & Worker nodes)


Temporary compute infrastructure


Disks / managed identities (in some cases)


Key Point:

👉 These resources are auto-created and auto-managed by Databricks


3. Azure System Resource Group

Example:



NetworkWatcherRG


Contains:



Network Watcher


NSG flow logs


Network diagnostics tools


Key Point:

👉 This is created automatically by Azure subscription, not Databricks


🔷 Databricks Architecture (Important for Interviews)

Databricks follows a Control Plane + Data Plane architecture

1. Control Plane (Managed by Databricks)



Workspace UI


Job orchestration


Notebook management


Cluster control APIs



2. Data Plane (Runs in Azure Subscription)



Spark clusters (VMs)


Storage access (ADLS / Blob)


Data processing workloads



🔷 Why Resources Appear Outside Your Resource Group

Reason:

Azure Databricks dynamically creates infrastructure in different resource groups for:



Cluster scalability


Isolation of compute resources


Security separation


Auto-scaling and lifecycle management



🔷 Key Interview Points



Databricks uses separate control plane and data plane architecture


Cluster compute is deployed in Databricks-managed resource groups


User resource group does NOT contain all runtime infrastructure


NetworkWatcherRG is Azure system-managed, not Databricks-related


Compute resources are ephemeral and auto-provisioned



🔷 One-Liner Summary (Very Important)
In Azure Databricks, the workspace resides in the user resource group, 
 
but compute clusters are created in Databricks-managed resource groups within the Azure data plane,
 
while Azure system services like Network Watcher exist in separate subscription-level resource groups

Step 1 : creating the compute.
✅ FINAL RECOMMENDED CONFIG (BEST FOR YOU)

Use this:

Compute name: learning-cluster


Policy: Unrestricted


Runtime: 17.3 LTS (OK) or 15.x LTS (safer)


Photon: ON


Node type: Standard_DC4as_v5 (OK)


Mode: Single Node


Auto terminate: 120 mins


Access mode: Single User



🧠 Interview insight (important)

You can say:


“A single-node cluster in Databricks is used for development and learning, 
where driver and worker run on the same VM. In production, 
multi-node clusters are used for distributed processing with separate driver and worker 
nodes.” 



🔷 Key Difference (Interview Table)

Feature DEV (Serverless) PROD (Dedicated Compute)
Purpose Experimentation Production pipelines
Setup None Config required
Scaling Automatic Controlled
Cost model Pay per use Optimized for long runs
Performance tuning Limited Full control
Reliability Medium High (SLA-driven)


Connecting ADLS with databricks.

Here’s a clean interview-ready summary you can copy and paste:
✅ Azure Databricks → ADLS Gen2 (Unity Catalog) Connection Summary
To connect Azure Databricks to ADLS Gen2 using Unity Catalog, we use Azure Managed Identity with Access Connector, which removes the need for storing secrets or keys.
🔷 Step 1: Azure Setup
Create an Azure Access Connector for Azure Databricks
Ensure System-assigned Managed Identity = ON
In the ADLS Gen2 Storage Account (IAM), assign:
Storage Blob Data Contributor role
To the Access Connector’s managed identity


🔷 Step 2: Create Storage Credential (Databricks)
In Databricks Catalog Explorer:
Create a Storage Credential
Type: Azure Managed Identity
Provide Access Connector Resource ID from Azure:
/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
🔷 Step 3: Create External Location
Map ADLS path in Unity Catalog:
CREATE EXTERNAL LOCATION bronze_loc
URL 'abfss://bronze@<storage-account>.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL adls_credential);




show storage credentials;

CREATE EXTERNAL LOCATION bronze_loc
URL 'abfss://bronze@sdmm.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL adls_credential;



🔷 Step 4: Grant Permissions
Grant access at Unity Catalog level:
GRANT READ FILES, WRITE FILES
ON EXTERNAL LOCATION bronze_loc
TO `data_engineers`;
(Only owner or metastore admin can execute this.)
🔷 Key URL Format (ADLS Gen2)
abfss://<container>@<storage-account>.dfs.core.windows.net/
Example:
abfss://bronze@datalakeprod01.dfs.core.windows.net/
🔷 ADF Role (Orchestration)
ADF is used for pipeline orchestration
It triggers Databricks notebooks and moves data
It does NOT manage storage connectivity directly
🔥 Key Point (Interview Statement)
Azure Databricks connects to ADLS Gen2 using Unity Catalog with Azure Managed Identity via an Access Connector.
 We create a storage credential using the connector resource ID, define an external location pointing to the ADLS container, 
 and manage access using Unity Catalog permissions. 
This ensures secure, keyless, and governed data access.

Layer	What goes here
Dataset	ONLY container (bronze)
Copy Activity	folder + file logic

Expression	Meaning
`@activity('Lookup1').output.value`	All rows returned by Lookup
`@item()`	Current row inside ForEach
`@item().folder_path`	folder_path column from current row
`@item().file_name`	file_name column from current row
`@dataset().p_folderPath`	Dataset parameter value passed from Copy Activity
`@dataset().p_fileName`	Dataset parameter value passed from Copy Activity.

Feature	DEV (Serverless)	PROD (Dedicated Compute)
Purpose	Experimentation	Production pipelines
Setup	None	Config required
Scaling	Automatic	Controlled
Cost model	Pay per use	Optimized for long runs
Performance tuning	Limited	Full control
Reliability	Medium	High (SLA-driven)

Keerthana Blogs

Hands on project for ADF

Azure Pipeline Creation and Configuration Steps

1. Create Resource Group and Storage Account

Azure Data Lake Storage (ADLS Gen2)

Storage Structure Example

2. Azure Data Factory (ADF) Setup

Access Configuration

3. Linked Service Configuration

4. Dataset Creation

Source Dataset

Sink Dataset

5. Pipeline Creation

6. Monitoring

7. Parameterized Pipeline (Best Practice)

Pipeline Parameters

Dataset Parameters Mapping

Key Concept

Best Practices (Industry Standard)

Introduce REAL WORLD STRUCTURE (SAP MM + SD)

Source tab:

Query:

Set:

🧠 SIMPLE FLOW (REMEMBER THIS)

🧠 GOLDEN RULE (NEVER FORGET)

🚨 WHY YOU GOT DUPLICATION

Fix + Scale Your Pipeline (Multi-Entity Design)

Next Hands-on Task: Add Databricks Notebook Activity to your ADF pipeline

NetworkWatcherRG

Azure Databricks Resource Group & Cluster Architecture Notes

🔷 Overview

🔷 Resource Group Types

1. User-Defined Resource Group (Primary RG)

2. Databricks-Managed Resource Group (Hidden/System RG)

3. Azure System Resource Group

🔷 Databricks Architecture (Important for Interviews)

1. Control Plane (Managed by Databricks)

2. Data Plane (Runs in Azure Subscription)

🔷 Why Resources Appear Outside Your Resource Group

Reason:

🔷 Key Interview Points

🔷 One-Liner Summary (Very Important)

✅ FINAL RECOMMENDED CONFIG (BEST FOR YOU)

🧠 Interview insight (important)

🔷 Key Difference (Interview Table)

✅ Azure Databricks → ADLS Gen2 (Unity Catalog) Connection Summary

🔷 Step 1: Azure Setup

🔷 Step 2: Create Storage Credential (Databricks)

/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/accessConnectors/<connector-name>

🔷 Step 3: Create External Location

🔷 Step 4: Grant Permissions

🔷 Key URL Format (ADLS Gen2)

🔷 ADF Role (Orchestration)

🔥 Key Point (Interview Statement)

Azure Databricks connects to ADLS Gen2 using Unity Catalog with Azure Managed Identity via an Access Connector.

Comments

Post a Comment

Popular posts from this blog

Entity Relationship (ER) Diagram Model with DBMS Example

SQL Joins and advanced joins and Subqueries

GIT BASH