Skip to main content

session 8 _silver layer transformation

To ingest data from an API into a Databricks environment (especially using the Lakehouse architecture), you typically follow a structured pipeline where you manage the data using various layers (e.g., Bronze, Silver, and Gold). The Silver layer is typically where you cleanse and enrich the data before it's fully prepared for downstream analytics, business intelligence, or machine learning tasks.

1. Parquet File

Parquet is a columnar file format that is commonly used for storing large datasets in a compressed and optimized manner. It is the default storage format for many data processing systems like Apache Spark, Hive, and BigQuery.

Features of Parquet Files:

  • Columnar storage: Parquet stores data in columns rather than rows, making it highly efficient for analytical queries (e.g., aggregate functions).

  • Efficient compression: It offers high compression rates, which reduces storage costs.

  • Schema support: Parquet files include schema information (e.g., data types, column names), so they are self-describing.

  • Interoperability: Parquet is an open-source format, and it can be used by multiple data engines (Spark, Hive, AWS Redshift, etc.).

Parquet as Raw Data:

In the context of the Lakehouse architecture or data pipelines, Parquet files are often used to store raw or semi-processed data in the Bronze layer of a Lakehouse, or even in the Silver layer once cleaned and transformed.


2. Delta Lake Table

Delta Lake is an open-source storage layer built on top of existing data lakes. It enhances the Parquet file format by adding ACID transactions, schema enforcement, and time travel features.

Delta Lake is essentially a transactional layer on top of Parquet files. It combines the benefits of Apache Parquet with ACID transaction support and other advanced features.

Key Features of Delta Lake:

  • ACID Transactions: Delta Lake brings atomicity, consistency, isolation, and durability (ACID) properties to data lakes. This allows you to perform updates, inserts, and deletes on data without worrying about corrupting the data.

  • Schema Evolution: Delta Lake supports schema evolution, meaning you can change the schema of your table (add/remove columns) as your data model evolves. It can also enforce schema validation.

  • Time Travel: Delta Lake supports time travel, allowing you to query previous versions of the data. This is useful for debugging, audit trails, or rolling back changes.

  • Optimized Reads and Writes: Delta Lake leverages data skipping and optimized file formats for faster reads and writes, especially for large datasets.

  • Concurrency: Delta Lake allows multiple concurrent reads and writes, enabling collaborative data processing.

If we want to read the data which is present inside the azure blog storage .

to SQL 

SELECT * 
FROM unity_catalog_name.schema_name.table_name
LIMIT 10;

like hive always use the command USE catalog unity-catalog_name ;



Creating the silver layer table and defining the datatype for the bronze table, because it gives better clarity to make the transformation and enhance the data quality.

and adding two extra rows for inserted time and updated time since data will be loading in incremental format .




After the creation of the table, we can see that the table type is managed and in properties as delta.



And in overview, we can able to see the schema of the tables.
Now we have created the tables and we are going to load the tables.




Here we convert into desirable format using python casting using cast keyword .

cast(ROW_ID as bigint)
or by directly giving the int("2") == 2
to_date(DATE_OF_PRICING,'dd/MM/yyyy')

USE CATALOG pricing_analytics;
INSERT INTO  silver.daily_pricing_silver
SELECT
to_date(DATE_OF_PRICING,'dd/MM/yyyy'),
cast(ROW_ID as bigint) ,
STATE_NAME,
MARKET_NAME,
PRODUCTGROUP_NAME,
PRODUCT_NAME,
VARIETY,
ORIGIN,
cast(ARRIVAL_IN_TONNES as decimal(18,2)),
cast(MINIMUM_PRICE as decimal(36,2)),
cast(MAXIMUM_PRICE as decimal(36,2)),
cast(MODAL_PRICE as decimal(36,2)),
source_file_load_date ,
current_timestamp(),
current_timestamp()
FROM pricing_analytics.bronze.daily_pricing limit 10;



Now insert the select statement into the created silver table.

just give
use catalog catalogname
Insert into select statement



The silver table is created with some transformation as desired.



Now we have loaded the data into the silver table, but we need to capture newly ingested data into silver data .
Instead of changing the code again and again, we need to analyze which data is getting ingested
Each time that process is called, CDC makes an incremental load .

We have technology that automatically supports the CDC with is in a silver and gold layer.


We already created the table while ingesting the data from HTTP.




Each table is unique in the Unity catalog.

We just add the few lines below to add the incremental data


remember to refractor ( clearning )the notebook to look the notebook clean and readable .

After cleaning the notebook , go to workflow and create the task to orchestrate the pipeline
rembember to create the task with job cluster so it automatically stops once the cluster is stoped
.
Decide the timing for the schecdule whehter it has be run once a day or 15 mins depends on the load.


Go to the advanced option to schedule it every day

Module Summary

Databricks Functionalities:

  1.  What Types of transformations are performed in the silver layer?

    Please never say very basic transformations like concatenating columns or data type conversions. Always starts with some complex transformations and use any of them given below (Some transformations we will be doing in next coming chapters , But I consolidated here for complete reference


    • Standardising Date time values from different source systems

    • Integrating(Joining) data from multiple source tables (or) coming from multiple source systems. We are learning this in the Section 10.

    • Populating intermediate tables for loading reporting(covered in section 10) and datalike tables(covered in section 15)

    • Deriving business mapping values for the source values (e.g. converting different currency values into one single currency)

    • Converting complex nested JSON data into flattened tables (Covered in section 14)

  2.  Advantages Of  Using Delta Lake Tables?

    • Delta Lake tables support ACID-compliant

    • ACID compliance enables Data Integrity and Consistency

    • Delta tables supports time versioning and each commit on the table creates new version of table data

    • Allows to query the different versions of the data

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...