Skip to main content

Transformation - section 6 - data flow

 



Feature from SlideExplanation
Code-free data transformationsData Flows in ADF allow you to build transformations using a drag-and-drop visual interface, with no need for writing Spark or SQL code.
Executed on Data Factory-managed Databricks Spark clustersInternally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters, managed by ADF, not Databricks itself. While it's similar in concept, this is not the same as your own Databricks workspace.
Benefits from ADF scheduling and monitoringData Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively.

⚠️ Important Clarification

Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace. Rather:

  • ADF Data Flows run on ADF-managed Spark clusters.

  • Azure Databricks notebooks (which you trigger via an "Execute Notebook" activity) run on your Databricks workspace.

So, these are two different engines:

  • Mapping Data Flows → ADF-managed Spark

  • Databricks Notebook Activity → Azure Databricks (your own cluster)


✅ When to Use Data Flows vs. Databricks

Use CaseUse Data FlowsUse Databricks
No-code / low-code transformation✅ Yes❌ Less suitable
Complex ML / Custom PySpark❌ Limited✅ Ideal
Fully integrated into ADF✅ Native✅ Via notebook activity
Need reusable notebooks / jobs❌ No notebooks✅ Yes


๐Ÿ”ท Types of Data Flows in ADF

TypeDescriptionPurpose
Data FlowCode-free data transformation at scaleIdeal for building ETL/ELT pipelines visually. Runs on ADF-managed Spark clusters. Supports complex joins, aggregations, lookups, pivot, derived columns, etc.
Wrangling Data Flow (Preview)Code-free data preparation at scale, designed for data exploration and profilingUses Power Query engine (same as Excel/Power BI). Good for data cleaning, shaping, quick transformation in a familiar UI. Limited in complex logic. Still in preview, with fewer features and limited scalability.

๐Ÿงฉ When to Use Which?

ScenarioUse Data FlowUse Wrangling Data Flow
Large-scale transformations✅ Yes❌ No
Data exploration & profiling❌ Not ideal✅ Yes
Complex ETL logic (joins, filters, aggregations)✅ Yes❌ No
Familiar with Power BI/Excel style❌ Not needed✅ Yes
Production-ready pipelines✅ Yes❌ No (Preview-only)

๐Ÿ› ️ Backend Execution Engine

  • Data Flows use ADF’s Spark-based runtime under the hood.

  • Wrangling Data Flows use Power Query engine, which converts logic into Spark only during execution.


๐Ÿ“ Summary

FeatureData FlowWrangling Data Flow
InterfaceVisual, scalable builderPower Query (Excel/Power BI style)
StatusGenerally AvailablePreview (not for prod)
Ideal ForETL at scaleLightweight wrangling
BackendSpark (ADF-managed)










Power Query → Spark












Phase Mode Purpose Cost When to Use
Development Data Flow Debug Test & preview transformations Debug cluster cost While building or troubleshooting
Production Pipeline Execution Execute full pipeline logic Based on activity runtime In deployed, automated pipelines

In Each of the data flow we need souce and sink transformation but we can add as many souces we like 

creting the souce transformation first .



after that click test connection and test the connection is successful 


1. Source: CasesAndDeathsSource

  • Dataset used: ds_raw_cases_and_deaths

  • This step loads raw COVID-19 data.

๐Ÿ”น Data Example (before filtering):

countrycontinentdatecasesdeathspopulation
ItalyEurope2020-02-012060M
BrazilAmericas2020-02-0131212M

2. FilterEuropeOnly

  • Applies a filter to keep only European countries.

  • Likely filter expression:

    sql
    continent == 'Europe' && isNotNull(country_code)

๐Ÿ”น Purpose: Remove non-European countries, ensuring you're working with relevant regional data.

๐Ÿ”น Data After Filtering:

countrycontinentdatecasesdeathscountry_code
ItalyEurope2020-02-0120IT
SpainEurope2020-02-0151ES

3. SelectOnlyRequiredFields

  • Renames and reduces the number of fields to only essential ones.

๐Ÿ”น Input fields might include: continent, country, country_code, population, cases, deaths, source, date.

๐Ÿ”น After transformation, fields might become:

countrycountry_codepopulationcases_countdeaths_countreported_date
ItalyIT60M202020-02-01

✅ This step standardizes column names to prepare for downstream logic like pivoting and lookups.


4. PivotCounts

  • Performs a pivot transformation on case and death counts.

๐Ÿ”น For example, converts multiple rows like:

text
| metric | value | |--------|-------| | cases | 2 | | deaths | 0 |

Into:

text
| cases_count | deaths_count | |-------------|--------------| | 2 | 0 |

This step:

  • Groups by keys such as country_code and date.

  • Pivots metric (cases, deaths) into separate columns.

✅ Makes the data easier to analyze and visualize downstream.


5. LookupCountry

  • Performs a lookup join using the ds_country_lookup dataset.

๐Ÿ”น This enriches the dataset with metadata, like:

  • _entry_code_3_digit (likely ISO alpha-3 code)

  • Region, subregion, or demographic grouping

  • Possibly economic or geographic information

๐Ÿ”น Join key: country_code

๐Ÿ”น Result:

countrycountry_code_entry_code_3_digitpopulationcasesdeathsdate
ItalyITITA60M20...

✅ Adds contextual intelligence to raw data, enhancing its utility.


6. SelectForSink

  • Final selection of 8 columns that will be written to the sink.

  • May look like:

    • country

    • country_code

    • population

    • cases_count

    • deaths_count

    • reported_date

    • _entry_code_3_digit

    • source

✅ Ensures the final schema is clean, ready for output.


7. Data Preview

This lets you preview the data at this transformation stage.

๐Ÿงช You can see:

  • Rows grouped by reported_date

  • Repeated population values

  • Cases and deaths changing per date

  • Source is consistent: "Epidemic intelligence, national..."

✅ Useful for debugging and validation before writing to the sink.


8. Sink (Not Yet Shown)

You’ll typically route this data to:

  • Azure Data Lake Storage (ADLS) for historical archive

  • SQL Database for Power BI reporting

  • Snowflake or Synapse for analytical modeling

  • Blob Storage in parquet/CSV for further consumption


๐Ÿ’ก Best Practices & Observations:

AreaRecommendation
Filter stepUse isNull() checks to avoid losing records
PivotEnsure pivot columns are finite and known in advance
LookupEnsure country codes are standardized (2-digit vs. 3-digit)
Column namesUse consistent naming: snake_case or camelCase
Sink schemaValidate sink schema before full execution
ParameterizationUse parameters to dynamically load by date or region
Debug modeAlways test in debug mode before triggering pipeline runs

✅ Summary:

This Data Flow:

  • Transforms COVID-19 data

  • Filters for Europe

  • Pivots case/death metrics

  • Enriches with country metadata

  • Prepares clean data for downstream analytics


Create a Mapping Data Flow (the design of your data transformations)
  • Create a Pipeline

  • Inside the pipeline, add a Data Flow Activity

  • In the Data Flow Activity’s settings, select the Mapping Data Flow you designed earlier

  • When the pipeline runs, it triggers the Mapping Data Flow via the Data Flow Activity


  • Enable Data Flow Debug:

    • Click "Data Flow Debug" in the Data Flow designer (top bar).

    • This spins up a temporary Spark cluster.

    • You can preview data, test transformations, and check outputs quickly.

    • TTL (Time to Live) controls how long the debug cluster stays up to reduce spin-up time and cost.

    Avoid pipeline-level debug when not necessary:

    • Pipeline Debug runs the Data Flow on full production IR, which is more expensive.

    • Use only when you need to test integration with other activities, like triggers, lookups, or parameters.



    After creating a data flow in Azure Data Factory or Synapse Pipelines, you can manually trigger the pipeline by selecting "Trigger Now". This allows you to test the pipeline without having to create a scheduled trigger.

    When running a pipeline that includes a Data Flow activity, Spark will automatically allocate a compute cluster in the background to execute the job during the manual run. This eliminates the need to pre-provision compute resources for testing.

    This method is useful for testing and debugging purposes before setting up automated or scheduled triggers.

    Comments

    Popular posts from this blog

    session 19 Git Repository

      ๐Ÿ” Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

    Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

     After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

    ingestion of data from the database

    There are two ways to read the data from the database , which we decides depending upon the data sources. 1. ODBC You're using ODBC to ingest data into a pipeline, you generally need the following information 1. ODBC Connection String    connection_string = ( "Driver={ODBC Driver 17 for SQL Server};" "Server=my_server.database.windows.net;" "Database=my_database;" "UID=my_user;" "PWD=my_password;" ) Key components of a connection string often include: Driver : The name of the specific ODBC driver installed on the system (e.g., {ODBC Driver 17 for SQL Server} , {PostgreSQL Unicode} , {MySQL ODBC 8.0 Unicode Driver} ). Server (or Host ): The hostname or IP address of the database server. Port : The port number on which the database server is listening (if not default). Database (or Initial Catalog ): The name of the specific database you want to connect to within the server. UID (User ID): The usernam...