Session 10 - Reporting the dimensional and fact table

Spark SQL is an ideal choice when the data is structured format .go to SQl editor and save the file.

We are going to create the reporting tables as mentioned in the above column.

We create this to populate the first-dimensional table.

if we have to create multiple transformations, we can able to do it by chaining the transformations but if we want to do it for different source tables, we have to do it by staging as stage 1

Now we are adding the surrogate table to the dimensional table.

86. Dimension table load - generate surrogate keys.

which is

Here we generate the surrogate keys in the column State ID for the unique value obtained by CTAS above. Using the Row number (), the window function creates the surrogate keys (identity column).

And add the timestamp, now the query is ready for reporting. inserting into the table Reporting_dim_state_gold ( reporting table ) using insert as select statement .

Checking the data has been loaded into the reporting table and check all the values are unique because dimension tables have to be unique and surrogate keys (identify column ) and then it must be linked to the fact table . Here the data got inserted twice that's why we got duplicate tables . so always review before loading into the table .

87. Dimension Table Load - Change Data Capture (CDC).

source table --> silver layer table

reporting table --> Gold layer table.

now we are comparing the both table using the join query to check the all the data has been properly landed into the reporting table .

If we rerun the above comparing table, we can identify which newly added values (CDC) which needs to be added into the reporting table .

Here, the state ID ( has to unique ) but there it is the values as 1 , 2, because for 1 , 2 there will be some other values that will be present so we need to know the maximum values given to the state id table now we are going to create one extra table for the existing one and naming as stage 2 table

We are extracting the max values and cross-joining join to the existing stage 2 table so it automatically changes to 24 , 25 .

always remember during the first run there is no records be will available in the table because data is getting load as the batch .

now are completely ready to load the data into the reporting table.

88. Dimension table load -Load Reporting_DIM _state_load ( reporting table ).

we are to going to create the stage 3 ( intermediate table ) for table the join query with the existing stage 2 table and added with current and precious timestamp .

Remember: Always use the proper name for an alias.

Truncate the gold table (reporting table 0 which needs to be updated because the data that we will be available completely into the stage 3 table and insert the values using insert into select statement .

and check all the data got loaded into the reporting table.

Remember: Now we have created all the staging tables and loaded the data into the reporting table, which is the dimensional table. After this completed clearing the notebook by deleting the unnecessary commands or queries that we used to check the results and ran all to ensure that the data is flowing seamlessly .

89. Dimensional Table load - populate all the tables in the gold layer.

Now we are creating the new dimensional table likewise to the precious change table but remember to change the tables name and column properly.

After creating all the tables now we are writing the below query to know the data WHICH IS RUN BEFORE BE getting updated after one run .

90. Dimensional DATE table processing:

The most common or must table in the reporting table in a gold layer is the time table, sometimes its date or timestamp, somehow it's related to time.

Her,e the data is based on the DATE.

we have Pandas framework to used the date fact table .

to go the Workspace and create the new notebook and save the notebook then connect and start the cluster .

to import the pandas , used pandas. date_range(start = ' ' and end = ' ') and assign it to the dataframe and convert pandas dataframe into spark dataframe to diaplay the datafarme use display() function .

creating the column with the calander date values and convert into the temp view to convet tinto the table .

now the datatype to string to date ,

adding current timestamp and thus table is ready to get loaded into the gold layer for reporting table (dimentional table ).

91.Fact_table daily pricing table Gold layer

step1: we need to map the source table in silver layer to create the fact table

step 2 : we need to link the surrgate keys to the fact table .

These are the fact table column has the matrics information .

Separating the surrogate key or identity column using the select and join query.

using the datalakehouse process_run to get the up-to-date data.

Likewise, just add the table name to the join query and separate the column from the dimensional table .

Now adding the all the metrics column from the source table and current timestamp in the same query

Now we got the all the columns required for the fact table and we can insert the fact table which we have created in the gold layer.

Giving this insert command for the incremental load .

92. Slowly changing Dimensions table load - SCD Type 1 and Type 2 :

We have created so far to load the data alone, and we have the existing data changes.

Mostly, the data won't get changed easily for dimensional tables because it's product descriptive. But over time, in that case, we need to use the SCD concept.

93. SCD TYPE1 processing - type 1

Very important interview questions :

Cloning one table into another

Create a table catalog. schema.table_name like catalog. schema.tablename

Here, we just replace the existing product type with a new product type, which is onion . Then we need to update in the table name in all the staging table creation till reporting into the fact table .

Other interview questions: creating the table using the MERGE statement.

In SQL, the MERGE the statement is a powerful tool that allows you to perform INSERT, UPDATE, or DELETE operations on a target table basedon the results of a join with a sourcetable. It essentially combines these three data manipulation language (DML) operations into a single statement.

MERGE INTO target_table [AS target_alias] USING source_table [AS source_alias] ON (merge_condition) WHEN MATCHED THEN merge_matched_action [WHEN MATCHED AND condition THEN merge_matched_action] WHEN NOT MATCHED [BY TARGET] THEN merge_not_matched_action [WHEN NOT MATCHED [BY TARGET] AND condition THEN merge_not_matched_action] [WHEN NOT MATCHED BY SOURCE THEN merge_not_matched_by_source_action [WHEN NOT MATCHED BY SOURCE AND condition THEN merge_not_matched_by_source_action]];

94. SCD TYPE 1 - part 2

Run the query for the reporting table, now we can see the product type has been changed.

95. SCD TYPE 2 overview :

We must keep the changes and the original product, or the changes column.

Here we have both the change records as well as the previous record.

If we add like that, confusion will happen because the product ID is the identity column or seggrate keys that need to be linked to the fact table, so we are adding the start date and end date columns for clarification.