Feature from Slide | Explanation |
---|---|
✅ Code-free data transformations | Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface, with no need for writing Spark or SQL code. |
✅ Executed on Data Factory-managed Databricks Spark clusters | Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters, managed by ADF, not Databricks itself. While it's similar in concept, this is not the same as your own Databricks workspace. |
✅ Benefits from ADF scheduling and monitoring | Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. |
⚠️ Important Clarification
Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace. Rather:
-
ADF Data Flows run on ADF-managed Spark clusters.
-
Azure Databricks notebooks (which you trigger via an "Execute Notebook" activity) run on your Databricks workspace.
So, these are two different engines:
-
Mapping Data Flows → ADF-managed Spark
-
Databricks Notebook Activity → Azure Databricks (your own cluster)
✅ When to Use Data Flows vs. Databricks
Use Case | Use Data Flows | Use Databricks |
---|---|---|
No-code / low-code transformation | ✅ Yes | ❌ Less suitable |
Complex ML / Custom PySpark | ❌ Limited | ✅ Ideal |
Fully integrated into ADF | ✅ Native | ✅ Via notebook activity |
Need reusable notebooks / jobs | ❌ No notebooks | ✅ Yes |
๐ท Types of Data Flows in ADF
Type | Description | Purpose |
---|---|---|
Data Flow | Code-free data transformation at scale | Ideal for building ETL/ELT pipelines visually. Runs on ADF-managed Spark clusters. Supports complex joins, aggregations, lookups, pivot, derived columns, etc. |
Wrangling Data Flow (Preview) | Code-free data preparation at scale, designed for data exploration and profiling | Uses Power Query engine (same as Excel/Power BI). Good for data cleaning, shaping, quick transformation in a familiar UI. Limited in complex logic. Still in preview, with fewer features and limited scalability. |
๐งฉ When to Use Which?
Scenario | Use Data Flow | Use Wrangling Data Flow |
---|---|---|
Large-scale transformations | ✅ Yes | ❌ No |
Data exploration & profiling | ❌ Not ideal | ✅ Yes |
Complex ETL logic (joins, filters, aggregations) | ✅ Yes | ❌ No |
Familiar with Power BI/Excel style | ❌ Not needed | ✅ Yes |
Production-ready pipelines | ✅ Yes | ❌ No (Preview-only) |
๐ ️ Backend Execution Engine
-
Data Flows use ADF’s Spark-based runtime under the hood.
-
Wrangling Data Flows use Power Query engine, which converts logic into Spark only during execution.
๐ Summary
Feature | Data Flow | Wrangling Data Flow |
---|---|---|
Interface | Visual, scalable builder | Power Query (Excel/Power BI style) |
Status | Generally Available | Preview (not for prod) |
Ideal For | ETL at scale | Lightweight wrangling |
Backend | Spark (ADF-managed) | Power Query → Spark |
1. Source: CasesAndDeathsSource
-
Dataset used:
ds_raw_cases_and_deaths
-
This step loads raw COVID-19 data.
๐น Data Example (before filtering):
country | continent | date | cases | deaths | population |
---|---|---|---|---|---|
Italy | Europe | 2020-02-01 | 2 | 0 | 60M |
Brazil | Americas | 2020-02-01 | 3 | 1 | 212M |
2. FilterEuropeOnly
-
Applies a filter to keep only European countries.
-
Likely filter expression:
๐น Purpose: Remove non-European countries, ensuring you're working with relevant regional data.
๐น Data After Filtering:
country | continent | date | cases | deaths | country_code |
---|---|---|---|---|---|
Italy | Europe | 2020-02-01 | 2 | 0 | IT |
Spain | Europe | 2020-02-01 | 5 | 1 | ES |
3. SelectOnlyRequiredFields
-
Renames and reduces the number of fields to only essential ones.
๐น Input fields might include: continent
, country
, country_code
, population
, cases
, deaths
, source
, date
.
๐น After transformation, fields might become:
country | country_code | population | cases_count | deaths_count | reported_date |
---|---|---|---|---|---|
Italy | IT | 60M | 2 | 0 | 2020-02-01 |
✅ This step standardizes column names to prepare for downstream logic like pivoting and lookups.
4. PivotCounts
-
Performs a pivot transformation on case and death counts.
๐น For example, converts multiple rows like:
Into:
This step:
-
Groups by keys such as
country_code
anddate
. -
Pivots
metric
(cases, deaths) into separate columns.
✅ Makes the data easier to analyze and visualize downstream.
5. LookupCountry
-
Performs a lookup join using the
ds_country_lookup
dataset.
๐น This enriches the dataset with metadata, like:
-
_entry_code_3_digit
(likely ISO alpha-3 code) -
Region, subregion, or demographic grouping
-
Possibly economic or geographic information
๐น Join key: country_code
๐น Result:
country | country_code | _entry_code_3_digit | population | cases | deaths | date |
---|---|---|---|---|---|---|
Italy | IT | ITA | 60M | 2 | 0 | ... |
✅ Adds contextual intelligence to raw data, enhancing its utility.
6. SelectForSink
-
Final selection of 8 columns that will be written to the sink.
-
May look like:
-
country
-
country_code
-
population
-
cases_count
-
deaths_count
-
reported_date
-
_entry_code_3_digit
-
source
-
✅ Ensures the final schema is clean, ready for output.
7. Data Preview
This lets you preview the data at this transformation stage.
๐งช You can see:
-
Rows grouped by
reported_date
-
Repeated population values
-
Cases and deaths changing per date
-
Source is consistent:
"Epidemic intelligence, national..."
✅ Useful for debugging and validation before writing to the sink.
8. Sink (Not Yet Shown)
You’ll typically route this data to:
-
Azure Data Lake Storage (ADLS) for historical archive
-
SQL Database for Power BI reporting
-
Snowflake or Synapse for analytical modeling
-
Blob Storage in parquet/CSV for further consumption
๐ก Best Practices & Observations:
Area | Recommendation |
---|---|
Filter step | Use isNull() checks to avoid losing records |
Pivot | Ensure pivot columns are finite and known in advance |
Lookup | Ensure country codes are standardized (2-digit vs. 3-digit) |
Column names | Use consistent naming: snake_case or camelCase |
Sink schema | Validate sink schema before full execution |
Parameterization | Use parameters to dynamically load by date or region |
Debug mode | Always test in debug mode before triggering pipeline runs |
✅ Summary:
This Data Flow:
-
Transforms COVID-19 data
-
Filters for Europe
-
Pivots case/death metrics
-
Enriches with country metadata
-
Prepares clean data for downstream analytics
Create a Pipeline
Inside the pipeline, add a Data Flow Activity
In the Data Flow Activity’s settings, select the Mapping Data Flow you designed earlier
When the pipeline runs, it triggers the Mapping Data Flow via the Data Flow Activity
✅ Enable Data Flow Debug:
-
Click "Data Flow Debug" in the Data Flow designer (top bar).
-
This spins up a temporary Spark cluster.
-
You can preview data, test transformations, and check outputs quickly.
-
TTL (Time to Live) controls how long the debug cluster stays up to reduce spin-up time and cost.
✅ Avoid pipeline-level debug when not necessary:
-
Pipeline Debug runs the Data Flow on full production IR, which is more expensive.
-
Use only when you need to test integration with other activities, like triggers, lookups, or parameters.
After creating a data flow in Azure Data Factory or Synapse Pipelines, you can manually trigger the pipeline by selecting "Trigger Now". This allows you to test the pipeline without having to create a scheduled trigger.
When running a pipeline that includes a Data Flow activity, Spark will automatically allocate a compute cluster in the background to execute the job during the manual run. This eliminates the need to pre-provision compute resources for testing.
This method is useful for testing and debugging purposes before setting up automated or scheduled triggers.
Comments
Post a Comment