session 8 _silver layer transformation

To ingest data from an API into a Databricks environment (especially using the Lakehouse architecture), you typically follow a structured pipeline where you manage the data using various layers (e.g., Bronze, Silver, and Gold). The Silver layer is typically where you cleanse and enrich the data before it's fully prepared for downstream analytics, business intelligence, or machine learning tasks.

1. Parquet File

Parquet is a columnar file format that is commonly used for storing large datasets in a compressed and optimized manner. It is the default storage format for many data processing systems like Apache Spark, Hive, and BigQuery.

Features of Parquet Files:

Columnar storage: Parquet stores data in columns rather than rows, making it highly efficient for analytical queries (e.g., aggregate functions).
Efficient compression: It offers high compression rates, which reduces storage costs.
Schema support: Parquet files include schema information (e.g., data types, column names), so they are self-describing.
Interoperability: Parquet is an open-source format, and it can be used by multiple data engines (Spark, Hive, AWS Redshift, etc.).

Parquet as Raw Data:

In the context of the Lakehouse architecture or data pipelines, Parquet files are often used to store raw or semi-processed data in the Bronze layer of a Lakehouse, or even in the Silver layer once cleaned and transformed.

2. Delta Lake Table

Delta Lake is an open-source storage layer built on top of existing data lakes. It enhances the Parquet file format by adding ACID transactions, schema enforcement, and time travel features.

Delta Lake is essentially a transactional layer on top of Parquet files. It combines the benefits of Apache Parquet with ACID transaction support and other advanced features.

Key Features of Delta Lake:

ACID Transactions: Delta Lake brings atomicity, consistency, isolation, and durability (ACID) properties to data lakes. This allows you to perform updates, inserts, and deletes on data without worrying about corrupting the data.
Schema Evolution: Delta Lake supports schema evolution, meaning you can change the schema of your table (add/remove columns) as your data model evolves. It can also enforce schema validation.
Time Travel: Delta Lake supports time travel, allowing you to query previous versions of the data. This is useful for debugging, audit trails, or rolling back changes.
Optimized Reads and Writes: Delta Lake leverages data skipping and optimized file formats for faster reads and writes, especially for large datasets.
Concurrency: Delta Lake allows multiple concurrent reads and writes, enabling collaborative data processing.

If we want to read the data which is present inside the azure blog storage .

to SQL

SELECT *

FROM unity_catalog_name.schema_name.table_name

LIMIT 10;

like hive always use the command USE catalog unity-catalog_name ;

Creating the silver layer table and defining the datatype for the bronze table, because it gives better clarity to make the transformation and enhance the data quality.

and adding two extra rows for inserted time and updated time since data will be loading in incremental format .

After the creation of the table, we can see that the table type is managed and in properties as delta.

And in overview, we can able to see the schema of the tables.

Now we have created the tables and we are going to load the tables.

Here we convert into desirable format using python casting using cast keyword .

cast(ROW_ID as bigint)

or by directly giving the int("2") == 2

to_date(DATE_OF_PRICING,'dd/MM/yyyy')

USE CATALOG pricing_analytics;
INSERT INTO  silver.daily_pricing_silver
SELECT
to_date(DATE_OF_PRICING,'dd/MM/yyyy'),
cast(ROW_ID as bigint) ,
STATE_NAME,
MARKET_NAME,
PRODUCTGROUP_NAME,
PRODUCT_NAME,
VARIETY,
ORIGIN,
cast(ARRIVAL_IN_TONNES as decimal(18,2)),
cast(MINIMUM_PRICE as decimal(36,2)),
cast(MAXIMUM_PRICE as decimal(36,2)),
cast(MODAL_PRICE as decimal(36,2)),
source_file_load_date ,
current_timestamp(),
current_timestamp()
FROM pricing_analytics.bronze.daily_pricing limit 10;



Now insert the select statement into the created silver table.

just give 
use catalog catalogname 
Insert into select statement 



The silver table is created with some transformation as desired.



Now we have loaded the data into the silver table, but we need to capture newly ingested data into silver data .
Instead  of changing the code again and again, we need to analyze which data is getting ingested 
Each time that process is called, CDC makes an incremental load .

We have technology that automatically supports the CDC with is in a silver and gold layer.


We already created the table while ingesting the data from HTTP.




Each table is unique in the Unity catalog.

We just add the few lines below to add the incremental data 


remember to refractor ( clearning )the notebook to look the notebook clean and readable .

After cleaning the notebook , go to workflow and create the task to  orchestrate the pipeline 
rembember to create the task with job cluster so it automatically stops once the cluster is stoped 
.
Decide the timing for the schecdule whehter it has be run once a day or 15 mins depends on the load.


Go to the advanced option to schedule it every day 

Module Summary
Databricks Functionalities:
 What Types of transformations are performed in the silver layer?Please never say very basic transformations like concatenating columns or data type conversions. Always starts with some complex transformations and use any of them given below (Some transformations we will be doing in next coming chapters , But I consolidated here for complete reference

Standardising Date time values from different source systems
Integrating(Joining) data from multiple source tables (or) coming from multiple source systems. We are learning this in the Section 10.
Populating intermediate tables for loading reporting(covered in section 10) and datalike tables(covered in section 15)
Deriving business mapping values for the source values (e.g. converting different currency values into one single currency)
Converting complex nested JSON data into flattened tables (Covered in section 14)
 Advantages Of  Using Delta Lake Tables?Delta Lake tables support ACID-compliant
ACID compliance enables Data Integrity and Consistency
Delta tables supports time versioning and each commit on the table creates new version of table data
Allows to query the different versions of the data

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Keerthana Blogs

Search This Blog