Skip to main content

ADF - ingestion from website and from the storage blog

 


in ECDC website the reports are providing in the weekly instead of daily so for the easy access authir gave the csv files in the git hub repository.





Always review the data before ingestion to understand its structure and contents.






test the connection and click on create.


now the linked services created.

create one new dataset .



  • Name: ds_cases_deaths_raw_csv_http - This is the unique identifier for this dataset within ADF. The naming convention suggests it's for raw CSV data related to cases and deaths, retrieved via HTTP.

  • Linked service: ls_http_opendata_ecdc_europa_eu - This is the connection information that tells ADF where to find the data. In this case, it's an HTTP linked service, pointing to a public data source from ECDC (European Centre for Disease Prevention and Control). The pencil icon next to it indicates it can be edited.

  • Relative URL: covid19/nationalcasedeath/csv/covid19/raw/main/ecdc_data/cases_deaths.csv - This is the specific path to the CSV file relative to the base URL defined in the linked service. It points to a file named cases_deaths.csv within a nested folder structure related to COVID-19 data. This suggests that the data is publicly available COVID-19 statistics.

  • First row as header: This checkbox is selected (indicated by the checkmark). This means that ADF will treat the first row of the cases_deaths.csv file as column headers, not as data. This is crucial for correct schema inference and data mapping.

  • Import schema:

    • From connection/store: This option is selected. ADF will try to infer the schema (column names and data types) directly from the CSV file based on the first row (headers) and data samples. This is common for initial setup.

    • From sample file: (Not selected) - You could provide a separate sample file to infer the schema from.

    • None: (Not selected) - You would manually define the schema.

In essence, this dataset defines how to access a specific CSV file containing COVID-19 cases and deaths data, located on an HTTP server managed by ECDC, and how to interpret its structure (first row as headers, schema inferred).

now create another dataset ..





then click ok dataset is created .



create the copy activity copying the dataset from the http server to sink dataset.









🔹 Parameters vs Variables in ADF

FeatureParametersVariables
ScopePipeline levelPipeline level
MutabilityImmutable (can't change after set)Mutable (can change during pipeline)
Use CasesInput values for pipelineTemporary storage during pipeline

🔹 Using Parameters

1. Define Parameters

Go to the pipeline → Parameters tab → Click + New to add a parameter.

plaintext

Name: myParam Type: String Default value: optional

2. Pass Parameters to Activities

For example, in a Copy Data activity:

  • Go to the Source or Sink.

  • In the dynamic content box (click the "Add dynamic content" link), you can use:


@pipeline().parameters.myParam

You can also pass parameters to datasets or linked services.

3. Pass Parameters When Triggering a Pipeline

When running a pipeline manually or via another pipeline (using Execute Pipeline), you can provide parameter values.


🔹 Using Variables

1. Define Variables

In the pipeline → Variables tab → Click + New.

plaintext

Name: myVar Type: String / Boolean / Array

2. Set Variable (Set Variable Activity)

  • Drag Set Variable activity into your pipeline.

  • In the settings:

    • Variable name: myVar

    • Value: Use dynamic content (e.g.,)


'Hello World' @concat('Folder/', pipeline().parameters.fileName)

3. Modify Variable (Append Variable Activity)

Used only for Array variables to add elements during the pipeline.


🔹 Using Variables and Parameters in Activities

Here are a few examples of dynamic usage:

📌 In Copy Activity Source/Sink path:

adf

@concat('input/', pipeline().parameters.fileName)

📌 In If Condition Activity:

adf

@equals(variables('myVar'), 'expectedValue')

📌 In Stored Procedure Activity:

Pass parameter to SP:

adf

@variables('sqlParam')

🔹 Common Use Case

Suppose you want to pass a file name as a parameter and use it dynamically in a copy activity:

1. Define a parameter: fileName

2. In source dataset parameter:

adf

@pipeline().parameters.fileName

3. Set a variable based on this:

adf

@concat('processed/', pipeline().parameters.fileName)

✅ Summary

  • Parameters: Set once, used for configuration and input.

  • Variables: Used to store intermediate values during pipeline execution.

  • You can use expressions like @concat, @pipeline(), @variables(), etc., in most ADF activity fields via Dynamic Content.


Lookup activity, and for each loop activity 

[Lookup File List]
        ↓
   [ForEach File]
        ↓
[Copy / Databricks Activity]

For example, if we have 4 files that we need to ingest, we can use a lookup activity to consume all the files at once and iterate using a for each loop activity 

Here we are giving the output from the look-up activity as input for each activity.





In order to set the variables in inside the for loop we are giving set for loop activity .


in that set the variable as output of the source URL 



Now we are parameterizing the hardcoded URL in the linked service.


and then by using the copy activity we are ingesting the data .

Metadata-driven Architecture 

there are 4 source URL we need to ingest using one pipeline using one trigger .


within the for each activity the copy activity was there . and attached the trigger to that pipeline.



Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...