Spark SQL is an ideal choice when the data is structured format .go to SQl editor and save the file.
We create this to populate the first-dimensional table.
if we have to create multiple transformations, we can able to do it by chaining the transformations but if we want to do it for different source tables, we have to do it by staging as stage 1
Now we are adding the surrogate table to the dimensional table.
which is
Here we generate the surrogate keys in the column State ID for the unique value obtained by CTAS above. Using the Row number (), the window function creates the surrogate keys (identity column).
And add the timestamp, now the query is ready for reporting. inserting into the table Reporting_dim_state_gold ( reporting table ) using insert as select statement .
Checking the data has been loaded into the reporting table and check all the values are unique because dimension tables have to be unique and surrogate keys (identify column ) and then it must be linked to the fact table . Here the data got inserted twice that's why we got duplicate tables . so always review before loading into the table .
source table --> silver layer table
reporting table --> Gold layer table.
now we are comparing the both table using the join query to check the all the data has been properly landed into the reporting table .
If we rerun the above comparing table, we can identify which newly added values (CDC) which needs to be added into the reporting table .
Here, the state ID ( has to unique ) but there it is the values as 1 , 2, because for 1 , 2 there will be some other values that will be present so we need to know the maximum values given to the state id table now we are going to create one extra table for the existing one and naming as stage 2 table
always remember during the first run there is no records be will available in the table because data is getting load as the batch .
now are completely ready to load the data into the reporting table.
88. Dimension table load -Load Reporting_DIM _state_load ( reporting table ).
we are to going to create the stage 3 ( intermediate table ) for table the join query with the existing stage 2 table and added with current and precious timestamp .
Remember: Always use the proper name for an alias.
Truncate the gold table (reporting table 0 which needs to be updated because the data that we will be available completely into the stage 3 table and insert the values using insert into select statement .
Remember: Now we have created all the staging tables and loaded the data into the reporting table, which is the dimensional table. After this completed clearing the notebook by deleting the unnecessary commands or queries that we used to check the results and ran all to ensure that the data is flowing seamlessly .
89. Dimensional Table load - populate all the tables in the gold layer.
Now we are creating the new dimensional table likewise to the precious change table but remember to change the tables name and column properly.
The most common or must table in the reporting table in a gold layer is the time table, sometimes its date or timestamp, somehow it's related to time.
Her,e the data is based on the DATE.
to go the Workspace and create the new notebook and save the notebook then connect and start the cluster .
to import the pandas , used pandas. date_range(start = ' ' and end = ' ') and assign it to the dataframe and convert pandas dataframe into spark dataframe to diaplay the datafarme use display() function .
creating the column with the calander date values and convert into the temp view to convet tinto the table .
now the datatype to string to date ,
adding current timestamp and thus table is ready to get loaded into the gold layer for reporting table (dimentional table ).
step1: we need to map the source table in silver layer to create the fact table
step 2 : we need to link the surrgate keys to the fact table .
Likewise, just add the table name to the join query and separate the column from the dimensional table .
Now we got the all the columns required for the fact table and we can insert the fact table which we have created in the gold layer.
We have created so far to load the data alone, and we have the existing data changes.
Mostly, the data won't get changed easily for dimensional tables because it's product descriptive. But over time, in that case, we need to use the SCD concept.
93. SCD TYPE1 processing - type 1
Very important interview questions :
Cloning one table into another
Here, we just replace the existing product type with a new product type, which is onion . Then we need to update in the table name in all the staging table creation till reporting into the fact table .
Other interview questions: creating the table using the MERGE statement.
In SQL, the on the results of a join with a source table. It essentially combines these three data manipulation language (DML) operations into a single statement.
MERGE
the statement is a powerful tool that allows you to perform INSERT
, UPDATE
, or DELETE
operations on a target table basedMERGE INTO target_table [AS target_alias]
USING source_table [AS source_alias]
ON (merge_condition)
WHEN MATCHED THEN
merge_matched_action
[WHEN MATCHED AND condition THEN
merge_matched_action]
WHEN NOT MATCHED [BY TARGET] THEN
merge_not_matched_action
[WHEN NOT MATCHED [BY TARGET] AND condition THEN
merge_not_matched_action]
[WHEN NOT MATCHED BY SOURCE THEN
merge_not_matched_by_source_action
[WHEN NOT MATCHED BY SOURCE AND condition THEN
merge_not_matched_by_source_action]];
94. SCD TYPE 1 - part 2
Run the query for the reporting table, now we can see the product type has been changed.
We must keep the changes and the original product, or the changes column.
If we add like that, confusion will happen because the product ID is the identity column or seggrate keys that need to be linked to the fact table, so we are adding the start date and end date columns for clarification.
For more details, some of the projects might have a record flag set as Y .
To create the table, we can use SHOW CREATE TABLE. To create the table quickly, just copy and paste the output to create a table.
Always add the timestamp at the end of the create query.
CREATE THE TABLE FOR SCD TYPE 2
96. SCD TYPE 2 processing - Take 1.
Keep their tables as it is so which were created when the SCD 1 implementation.
change the process name to type 1 to type 2.
Module Summary
Spark SQL Functionalities:
Create a new table from an existing table(CTAS) while removing duplicate rows.
Apply incremental processing and deduplication to process data from bronze to silver
Change Data Capture(CDC) Implementation Using Business Key Columns Between Source/Target Table
Use CREATE OR REPLACE TABLE to create Intermediate Temp Tables
Identify how to query a three-layer namespace using Unity Catalogue
Comments
Post a Comment