Skip to main content

Incremental strategy

 What is an Incremental Strategy?

An incremental strategy is a method of loading data where, instead of reloading the entire dataset every time, you only load the new or changed data since the last successful load.

The primary goal of an incremental strategy is to:

  1. Improve Performance: By processing only a subset of the data, the load time is significantly reduced compared to a full load.

  2. Reduce Resource Consumption: Less data means less compute, storage, and network bandwidth are used.

  3. Optimize Cost: Directly related to reduced resource consumption, leading to lower operational costs.

  4. Minimize Impact on Source Systems: Querying smaller datasets places less strain on transactional source systems.

Common Techniques for Implementing Incremental Strategy:

There are several common techniques used to identify and extract incremental data:

  1. Watermark Column (Timestamp or ID-based):

    • How it works: This is the most common and robust method. You identify a column in your source table that indicates when a record was last modified (LastModifiedDate, UpdateTimeStamp) or a monotonically increasing ID (AutoIncrementID).

    • Process:

      • During the initial load, you get all historical data.

      • After each successful incremental load, you store the highest value (the "watermark") of that column processed so far in a control table (e.g., in your data warehouse).

      • For subsequent loads, you query the source system for records where the watermark column's value is greater than the last stored watermark.

      • You then update the watermark in your control table with the new highest value from the current batch.

    • Pros: Relatively simple to implement, efficient, handles updates.

    • Cons: Doesn't directly handle deleted records from the source unless combined with other methods. Requires the source table to have a suitable watermark column.

  2. Change Data Capture (CDC):

    • How it works: CDC is a database feature (e.g., SQL Server CDC, Oracle CDC) that captures and records DML (Data Manipulation Language) changes (inserts, updates, deletes) made to a database table. It typically stores these changes in separate change tables or log files.

    • Process: Your ETL process reads directly from these CDC change tables or logs to identify all types of changes.

    • Pros: Captures inserts, updates, and deletes. Very accurate.

    • Cons: Can be complex to set up and manage. Adds overhead to the source database. Requires specific database features.

  3. Change Tracking:

    • How it works: Similar to CDC but lighter-weight. It tracks which rows have been changed and provides information about the change (insert/update/delete) but not the actual data values that were changed (only the primary key of the changed row).

    • Process: Your ETL reads the change tracking information and then queries the source table using the primary keys to get the current state of changed rows. For deletes, it provides the primary key of the deleted row.

    • Pros: Less overhead than full CDC. Handles inserts, updates, and deletes.

    • Cons: Doesn't provide old/new values for updates, requiring an additional lookup.

  4. Last Loaded Timestamp/Flag (Source-side):

    • How it works: The source system itself marks records as 'processed' after they are loaded (e.g., setting a LoadedFlag to true, or updating a LastLoadedDateTime).

    • Pros: Controlled by the source system, potentially simpler for specific applications.

    • Cons: Requires modifying the source system, which might not always be feasible or desirable. Can introduce coupling.

  5. Checksum/Hash Comparison:

    • How it works: For each record, a checksum or hash value is calculated based on the relevant columns. During the incremental load, you compare the checksum of records in the source with the checksum of corresponding records already in the target. If the checksums differ, the record has changed.

    • Pros: Can detect changes even if a LastModifiedDate isn't updated for all column changes.

    • Cons: Computationally intensive, especially for large datasets. Requires comparing every record, which can be slow.

  6. Full Table Comparison (Less common for large datasets):

    • How it works: A full extract of the source table is compared row-by-row (or using primary keys and checksums) against the entire target table to identify differences.

    • Pros: Guarantees data consistency.

    • Cons: Very inefficient and resource-intensive for large tables. Often used only for small lookup tables or during initial setup.

Choosing the Right Strategy:

The best incremental strategy depends on several factors:

  • Source System Capabilities: Does it support CDC, change tracking, or have a suitable watermark column?

  • Data Volume and Velocity: High volumes often demand more efficient methods like CDC or watermark.

  • Latency Requirements: How quickly does the data need to be synchronized?

  • Handling Deletions: Is it critical to replicate deletions, or is it acceptable to only replicate inserts/updates?

  • Impact on Source System: How much overhead can the source system tolerate?

  • Complexity vs. Robustness: Simpler solutions might be quicker to implement but less comprehensive.

In Azure Data Factory (ADF), the watermark column approach is very commonly implemented using Lookup and Copy activities, as described in the previous scenario answer. For more complex scenarios, ADF can integrate with CDC features (e.g., SQL Server CDC) or use Data Flows for more sophisticated change detection logic

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...