Azure Pipeline Creation and Configuration Steps
To log in to the Azure portal, a Microsoft account is required. After creating the account, sign in to the Azure portal and proceed with the following steps to build the data pipeline.
1. Create Resource Group and Storage Account
- Create a Resource Group in Azure.
- Under the resource group, create the required resources for the pipeline.
Azure Data Lake Storage (ADLS Gen2)
- Create a Storage Account.
- Enable Hierarchical Namespace to convert it into Data Lake Storage Gen2.
-
Inside the storage account:
-
Create a container (e.g.,
blog-container). - Organize data using folders/subfolders (can be created dynamically or manually using directory structure).
-
Create a container (e.g.,
Storage Structure Example
Create a storage account named sdmm:
-
sdmm-
gold-
processing -
sales
-
-
silver-
sd -
mm
-
-
bronze-
sd -
mm
-
-
2. Azure Data Factory (ADF) Setup
- Create an Azure Data Factory instance.
- Go to Managed Identities and enable System Assigned Managed Identity, then save.
Access Configuration
- Navigate to the Storage Account → IAM (Access Control).
-
Assign roles:
- Storage Blob Data Contributor
- Assign this role to the Data Factory managed identity
3. Linked Service Configuration
- Open ADF Studio.
- Go to Manage → Linked Services.
-
Create a linked service connecting:
- Azure Data Lake Storage Gen2
- Azure Data Factory
- Test the connection and ensure it succeeds.
- Click Create to save the linked service.
4. Dataset Creation
-
Create a file in storage:
-
Example path:
sap/mm/vendors.csv
-
Example path:
Source Dataset
- In ADF Studio: Author → Datasets → New
-
Choose:
- Azure → Azure Data Lake Storage Gen2
- Select the linked service created earlier.
- Browse to the file path.
- Set Import Schema = From Connection/Source (to retrieve metadata automatically).
- Create dataset.
Sink Dataset
-
Create another dataset:
- Type: Delimited Text
-
Name:
delimitedtext_sink
5. Pipeline Creation
- Create a new pipeline in ADF.
- Add a Copy Activity (used for moving data between source and destination).
Configure:
- Source dataset → ADLS Gen2 source file
- Sink dataset → destination dataset
Then:
- Click Validate All
- Click Publish All
Run the pipeline:
- Use Trigger → Trigger Now
- If no errors occur, the pipeline executes successfully.
6. Monitoring
- Go to the Monitor tab in ADF.
- Check pipeline execution status.
- Verify that the file is successfully copied into the bronze layer folder.
its not the good to single pipeline for one time use . we have to parameterized the pipeline so that we can able to run as many pipeline constructed under single pipeline.
7. Parameterized Pipeline (Best Practice)
Instead of creating separate pipelines for each file, use parameters to make the pipeline reusable.
Pipeline Parameters
-
p_fileName = vendors.csv -
p_folderPath = bronze/mm/vendors
Dataset Parameters Mapping
In the dataset, define parameters and map them as:
-
fileName→@pipeline().parameters.p_fileName -
folderPath→@pipeline().parameters.p_folderPath
Best Practices (Industry Standard)
- Pipeline controls the workflow logic
- Dataset controls the data structure/location
- Parameters connect pipeline and datasets dynamically
- Always prefer parameterized pipelines for reusability and scalability
Always use storage browser for eaier access for inside the storage accounts.
Introduce REAL WORLD STRUCTURE (SAP MM + SD)
We now design proper enterprise folders:
├── sap/│ ├── mm/│ │ ├── vendors/│ │ ├── materials/│ │ ├── purchase_orders/│ ││ ├── sd/│ ├── customers/│ ├── sales_orders/│ ├── deliveries/Create the sub folder by clicking add dictionary under sdto create the meta table. we need sql database to store the table .to creating the sql databse .Created SQL database successfully.Go to resourece group to check the databse have created sucessfully.We can see two entities for database first one is logical database which is for host and connection andother on eis for our use.To create the table give to database under resource group and click query editor.enter the credencials or microsoft entra auth click connect .create the table .CREATE TABLE dbo.ingestion_config(config_id INT IDENTITY(1,1),source_system VARCHAR(50),entity_name VARCHAR(100),file_name VARCHAR(100),folder_path VARCHAR(200),load_type VARCHAR(20),is_active CHAR(1),created_date DATETIME DEFAULT GETDATE());After creating Insert Your First Record (Vendors)INSERT INTO dbo.ingestion_config(source_system,entity_name,file_name,folder_path,load_type,is_active)VALUES('SAP_MM','vendors','vendors.csv','sap/mm/vendors','FULL','Y');create the dataset for metatable and linked with the database linked service which is already created.DS_METADATA_CONFIG dataset is created .create the pipeline and create the lookup activity .under that activity choose source dataset as dataset created linkd with sql database.in lookup cong:The Lookup activity is used to read data from a source and return it as output inside the pipeline.Source tab:
- Dataset:
DS_METADATA_CONFIGQuery:
SELECT *
FROM dbo.ingestion_config
WHERE is_active = 'Y';Set:
First row only = FALSE
in for loop :
in setting at items :@activity('Lookup1').output.value
inside the for loop add copy activity.
copy activity:
source dataset : database linked dataset.
sink dataset : create the dataset with adls to land.🧠SIMPLE FLOW (REMEMBER THIS)
the output saved inside bronze/mm/bronze/mm/vendors because i gave the folder name twice assink dataset as🧠GOLDEN RULE (NEVER FORGET)
Layer What goes here Dataset ONLY container (bronze) Copy Activity folder + file logic 🚨 WHY YOU GOT DUPLICATION
ADF is doing:
Dataset path + Copy Activity pathSo:
I implemented a metadata-driven ingestion framework in Azure Data Factorybronze/mm + bronze/mm/vendors_runtime
= bronze/mm/bronze/mm/vendors_runtimeusing Azure SQL Database as the control layer. A metadata table was createdin Azure SQL containing source system details, file names,folder paths, and load configurations for SAP MM entities like vendors.In ADF, a linked service and dataset were created to connect to the SQL metadata table.A Lookup activity is used to read active records from the metadata table using a SQL query.The output of the Lookup is passed to a ForEach activity, which iterates through each metadata record.Inside the ForEach loop, a Copy Data activity is used to dynamically ingest data from the source to Azure Data Lake Storage Gen2.The folder path and file name are parameterized using expressions like @item().folder_path and @item().file_name, enabling dynamicand reusable data ingestion without hardcoding values.This approach allows scalable onboarding of new data sources by simply updating metadata, without modifying the pipeline logic.Fix + Scale Your Pipeline (Multi-Entity Design)
Right now you only ingest:
- vendors.csv
Next, we make it generic for SAP MM + SAP SD
add three more rows in database for the materials , sales , customers
add the parameter in the dataset and customize in the path in the file path in the connection tab in dataset .
and add the @item().filepath and @item().folderpath.
Expression Meaning @activity('Lookup1').output.valueAll rows returned by Lookup @item()Current row inside ForEach @item().folder_pathfolder_path column from current row @item().file_namefile_name column from current row @dataset().p_folderPathDataset parameter value passed from Copy Activity @dataset().p_fileNameDataset parameter value passed from Copy Activity. Next Hands-on Task: Add Databricks Notebook Activity to your ADF pipeline
You’re moving from data movement → data processing.
Extend your current pipeline:
Lookup → ForEach → Copy Activity (Bronze load)to:
Lookup → ForEach → Copy Activity → Databricks Notebook Activity
creating the databricks workspace .
choose the workspace as hydrid instead of serverless . we need to practice it like enterprice .
or else inside the workspace we have only option to create sql warehopuse.
after changing and creating the databricks name sdmm-databricks1.
after creating we can see two things outside of my resource group.
Azure Databricks Resource Group & Cluster Architecture Notes
🔷 Overview
In Databricks on Azure, resources are not fully contained inside a single Resource Group. Instead, Azure Databricks uses a split architecture with both user-managed and system-managed components.
🔷 Resource Group Types
1. User-Defined Resource Group (Primary RG)
This is the resource group created by the user when provisioning Databricks workspace.
Contains:
- Databricks Workspace
- Workspace configuration
- Linked services (some cases)
Example:
databricks-rg-de-learning
2. Databricks-Managed Resource Group (Hidden/System RG)
When clusters are created, Azure automatically creates additional infrastructure in a managed resource group.
Naming pattern:
databricks-rg-<workspace-name>-<random-string>Contains:
- Cluster VMs (Driver & Worker nodes)
- Temporary compute infrastructure
- Disks / managed identities (in some cases)
Key Point:
👉 These resources are auto-created and auto-managed by Databricks
3. Azure System Resource Group
Example:
NetworkWatcherRGContains:
- Network Watcher
- NSG flow logs
- Network diagnostics tools
Key Point:
👉 This is created automatically by Azure subscription, not Databricks
🔷 Databricks Architecture (Important for Interviews)
Databricks follows a Control Plane + Data Plane architecture
1. Control Plane (Managed by Databricks)
- Workspace UI
- Job orchestration
- Notebook management
- Cluster control APIs
2. Data Plane (Runs in Azure Subscription)
- Spark clusters (VMs)
- Storage access (ADLS / Blob)
- Data processing workloads
🔷 Why Resources Appear Outside Your Resource Group
Reason:
Azure Databricks dynamically creates infrastructure in different resource groups for:
- Cluster scalability
- Isolation of compute resources
- Security separation
- Auto-scaling and lifecycle management
🔷 Key Interview Points
- Databricks uses separate control plane and data plane architecture
- Cluster compute is deployed in Databricks-managed resource groups
- User resource group does NOT contain all runtime infrastructure
NetworkWatcherRGis Azure system-managed, not Databricks-related- Compute resources are ephemeral and auto-provisioned
🔷 One-Liner Summary (Very Important)
In Azure Databricks, the workspace resides in the user resource group,
but compute clusters are created in Databricks-managed resource groups within the Azure data plane,
while Azure system services like Network Watcher exist in separate subscription-level resource groups
Step 1 : creating the compute.
✅ FINAL RECOMMENDED CONFIG (BEST FOR YOU)
Use this:
- Compute name:
learning-cluster- Policy: Unrestricted
- Runtime: 17.3 LTS (OK) or 15.x LTS (safer)
- Photon: ON
- Node type: Standard_DC4as_v5 (OK)
- Mode: Single Node
- Auto terminate: 120 mins
- Access mode: Single User
🧠Interview insight (important)
You can say:
“A single-node cluster in Databricks is used for development and learning,
where driver and worker run on the same VM. In production,
multi-node clusters are used for distributed processing with separate driver and worker
nodes.”
🔷 Key Difference (Interview Table)
Feature DEV (Serverless) PROD (Dedicated Compute) Purpose Experimentation Production pipelines Setup None Config required Scaling Automatic Controlled Cost model Pay per use Optimized for long runs Performance tuning Limited Full control Reliability Medium High (SLA-driven) Connecting ADLS with databricks.Here’s a clean interview-ready summary you can copy and paste:
✅ Azure Databricks → ADLS Gen2 (Unity Catalog) Connection Summary
To connect Azure Databricks to ADLS Gen2 using Unity Catalog, we use Azure Managed Identity with Access Connector, which removes the need for storing secrets or keys.
🔷 Step 1: Azure Setup
Create an Azure Access Connector for Azure Databricks
Ensure System-assigned Managed Identity = ON
In the ADLS Gen2 Storage Account (IAM), assign:
Storage Blob Data Contributor role
To the Access Connector’s managed identity
🔷 Step 2: Create Storage Credential (Databricks)
In Databricks Catalog Explorer:
Create a Storage Credential
Type: Azure Managed Identity
Provide Access Connector Resource ID from Azure:
/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
🔷 Step 3: Create External Location
Map ADLS path in Unity Catalog:
CREATE EXTERNAL LOCATION bronze_loc URL 'abfss://bronze@<storage-account>.dfs.core.windows.net/' WITH (STORAGE CREDENTIAL adls_credential);
show storage credentials; CREATE EXTERNAL LOCATION bronze_loc URL 'abfss://bronze@sdmm.dfs.core.windows.net/' WITH (STORAGE CREDENTIAL adls_credential;
🔷 Step 4: Grant Permissions
Grant access at Unity Catalog level:
GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION bronze_loc TO `data_engineers`;(Only owner or metastore admin can execute this.)
🔷 Key URL Format (ADLS Gen2)
abfss://<container>@<storage-account>.dfs.core.windows.net/Example:
abfss://bronze@datalakeprod01.dfs.core.windows.net/🔷 ADF Role (Orchestration)
ADF is used for pipeline orchestration
It triggers Databricks notebooks and moves data
It does NOT manage storage connectivity directly
🔥 Key Point (Interview Statement)
Azure Databricks connects to ADLS Gen2 using Unity Catalog with Azure Managed Identity via an Access Connector.
We create a storage credential using the connector resource ID, define an external location pointing to the ADLS container,
and manage access using Unity Catalog permissions.
This ensures secure, keyless, and governed data access.
Comments
Post a Comment