Transformation - section 6

Transformation - section 6 - data flow

Feature from Slide	Explanation
✅ Code-free data transformations	Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface, with no need for writing Spark or SQL code.
✅ Executed on Data Factory-managed Databricks Spark clusters	Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters, managed by ADF, not Databricks itself. While it's similar in concept, this is not the same as your own Databricks workspace.
✅ Benefits from ADF scheduling and monitoring	Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively.

⚠️ Important Clarification

Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace. Rather:

ADF Data Flows run on ADF-managed Spark clusters.
Azure Databricks notebooks (which you trigger via an "Execute Notebook" activity) run on your Databricks workspace.

So, these are two different engines:

Mapping Data Flows → ADF-managed Spark
Databricks Notebook Activity → Azure Databricks (your own cluster)

✅ When to Use Data Flows vs. Databricks

Use Case	Use Data Flows	Use Databricks
No-code / low-code transformation	✅ Yes	❌ Less suitable
Complex ML / Custom PySpark	❌ Limited	✅ Ideal
Fully integrated into ADF	✅ Native	✅ Via notebook activity
Need reusable notebooks / jobs	❌ No notebooks	✅ Yes

🔷 Types of Data Flows in ADF

Type	Description	Purpose
Data Flow	Code-free data transformation at scale	Ideal for building ETL/ELT pipelines visually. Runs on ADF-managed Spark clusters. Supports complex joins, aggregations, lookups, pivot, derived columns, etc.
Wrangling Data Flow (Preview)	Code-free data preparation at scale, designed for data exploration and profiling	Uses Power Query engine (same as Excel/Power BI). Good for data cleaning, shaping, quick transformation in a familiar UI. Limited in complex logic. Still in preview, with fewer features and limited scalability.

🧩 When to Use Which?

Scenario	Use Data Flow	Use Wrangling Data Flow
Large-scale transformations	✅ Yes	❌ No
Data exploration & profiling	❌ Not ideal	✅ Yes
Complex ETL logic (joins, filters, aggregations)	✅ Yes	❌ No
Familiar with Power BI/Excel style	❌ Not needed	✅ Yes
Production-ready pipelines	✅ Yes	❌ No (Preview-only)

🛠️ Backend Execution Engine

Data Flows use ADF’s Spark-based runtime under the hood.
Wrangling Data Flows use Power Query engine, which converts logic into Spark only during execution.

📝 Summary

Feature	Data Flow	Wrangling Data Flow
Interface	Visual, scalable builder	Power Query (Excel/Power BI style)
Status	Generally Available	Preview (not for prod)
Ideal For	ETL at scale	Lightweight wrangling
Backend	Spark (ADF-managed)	Power Query → Spark

Phase	Mode	Purpose	Cost	When to Use
Development	Data Flow Debug	Test & preview transformations	Debug cluster cost	While building or troubleshooting
Production	Pipeline Execution	Execute full pipeline logic	Based on activity runtime	In deployed, automated pipelines

In Each of the data flow we need souce and sink transformation but we can add as many souces we like

creting the souce transformation first .

after that click test connection and test the connection is successful

1. Source: CasesAndDeathsSource

Dataset used: ds_raw_cases_and_deaths
This step loads raw COVID-19 data.

🔹 Data Example (before filtering):

country	continent	date	cases	deaths	population
Italy	Europe	2020-02-01	2	0	60M
Brazil	Americas	2020-02-01	3	1	212M

2. FilterEuropeOnly

Applies a filter to keep only European countries.

Likely filter expression:

sql
continent == 'Europe' && isNotNull(country_code)

🔹 Purpose: Remove non-European countries, ensuring you're working with relevant regional data.

🔹 Data After Filtering:

country	continent	date	cases	deaths	country_code
Italy	Europe	2020-02-01	2	0	IT
Spain	Europe	2020-02-01	5	1	ES

3. SelectOnlyRequiredFields

Renames and reduces the number of fields to only essential ones.

🔹 Input fields might include: continent, country, country_code, population, cases, deaths, source, date.

🔹 After transformation, fields might become:

country	country_code	population	cases_count	deaths_count	reported_date
Italy	IT	60M	2	0	2020-02-01

✅ This step standardizes column names to prepare for downstream logic like pivoting and lookups.

4. PivotCounts

Performs a pivot transformation on case and death counts.

🔹 For example, converts multiple rows like:

text
| metric | value |
|--------|-------|
| cases  | 2     |
| deaths | 0     |

Into:

text
| cases_count | deaths_count |
|-------------|--------------|
| 2           | 0            |

This step:

Groups by keys such as country_code and date.
Pivots metric (cases, deaths) into separate columns.

✅ Makes the data easier to analyze and visualize downstream.

5. LookupCountry

Performs a lookup join using the ds_country_lookup dataset.

🔹 This enriches the dataset with metadata, like:

_entry_code_3_digit (likely ISO alpha-3 code)
Region, subregion, or demographic grouping
Possibly economic or geographic information

🔹 Join key: country_code

🔹 Result:

country	country_code	_entry_code_3_digit	population	cases	deaths	date
Italy	IT	ITA	60M	2	0	...

✅ Adds contextual intelligence to raw data, enhancing its utility.

6. SelectForSink

Final selection of 8 columns that will be written to the sink.
May look like:
- country
- country_code
- population
- cases_count
- deaths_count
- reported_date
- _entry_code_3_digit
- source

✅ Ensures the final schema is clean, ready for output.

7. Data Preview

This lets you preview the data at this transformation stage.

🧪 You can see:

Rows grouped by reported_date
Repeated population values
Cases and deaths changing per date
Source is consistent: "Epidemic intelligence, national..."

✅ Useful for debugging and validation before writing to the sink.

8. Sink (Not Yet Shown)

You’ll typically route this data to:

Azure Data Lake Storage (ADLS) for historical archive
SQL Database for Power BI reporting
Snowflake or Synapse for analytical modeling
Blob Storage in parquet/CSV for further consumption

💡 Best Practices & Observations:

Area	Recommendation
Filter step	Use `isNull()` checks to avoid losing records
Pivot	Ensure pivot columns are finite and known in advance
Lookup	Ensure country codes are standardized (2-digit vs. 3-digit)
Column names	Use consistent naming: snake_case or camelCase
Sink schema	Validate sink schema before full execution
Parameterization	Use parameters to dynamically load by date or region
Debug mode	Always test in debug mode before triggering pipeline runs

✅ Summary:

This Data Flow:

Transforms COVID-19 data
Filters for Europe
Pivots case/death metrics
Enriches with country metadata
Prepares clean data for downstream analytics

Create a Mapping Data Flow (the design of your data transformations)

Create a Pipeline

Inside the pipeline, add a Data Flow Activity

In the Data Flow Activity’s settings, select the Mapping Data Flow you designed earlier

When the pipeline runs, it triggers the Mapping Data Flow via the Data Flow Activity

✅ Enable Data Flow Debug:

Click "Data Flow Debug" in the Data Flow designer (top bar).
This spins up a temporary Spark cluster.
You can preview data, test transformations, and check outputs quickly.
TTL (Time to Live) controls how long the debug cluster stays up to reduce spin-up time and cost.

✅ Avoid pipeline-level debug when not necessary:

Pipeline Debug runs the Data Flow on full production IR, which is more expensive.
Use only when you need to test integration with other activities, like triggers, lookups, or parameters.

After creating a data flow in Azure Data Factory or Synapse Pipelines, you can manually trigger the pipeline by selecting "Trigger Now". This allows you to test the pipeline without having to create a scheduled trigger.

When running a pipeline that includes a Data Flow activity, Spark will automatically allocate a compute cluster in the background to execute the job during the manual run. This eliminates the need to pre-provision compute resources for testing.

This method is useful for testing and debugging purposes before setting up automated or scheduled triggers.

Keerthana Blogs

Search This Blog